openclaw-windows-node/pr120_full.diff
Sytone 1d836390e9
feat: SSH tunnel gateway, device identity, reconnect hardening
Adds SSH local port-forward support for secure remote gateway access, Ed25519 device identity for operator auth, enhanced Quick Send with error remediation, reconnect resilience, and OpenClaw.Cli validator tool.

Includes security fix: SSH user/host input validation to prevent command injection.

615 tests pass (516 shared + 99 tray).

Contributed by @sytone
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-01 00:03:43 -07:00

33201 lines
1.4 MiB

From be624fe4528580ad8ec89e9b92a16a9e16fc408e Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Sat, 21 Mar 2026 16:32:48 +0000
Subject: [PATCH 01/83] Add Windows voice mode foundation and AlwaysOn runtime
---
docs/VOICE-MODE.md | 371 ++++++
.../Capabilities/VoiceCapability.cs | 174 +++
src/OpenClaw.Shared/SettingsData.cs | 1 +
src/OpenClaw.Shared/VoiceModeSchema.cs | 144 +++
src/OpenClaw.Tray.WinUI/App.xaml.cs | 56 +-
.../Services/NodeService.cs | 82 +-
.../Services/SettingsManager.cs | 5 +-
.../Services/VoiceProviderCatalogService.cs | 155 +++
.../Services/VoiceService.cs | 1040 +++++++++++++++++
.../Windows/VoiceModeWindow.xaml | 92 ++
.../Windows/VoiceModeWindow.xaml.cs | 330 ++++++
.../OpenClaw.Shared.Tests/CapabilityTests.cs | 202 ++++
.../VoiceModeSchemaTests.cs | 77 ++
.../SettingsRoundTripTests.cs | 52 +
14 files changed, 2777 insertions(+), 4 deletions(-)
create mode 100644 docs/VOICE-MODE.md
create mode 100644 src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
create mode 100644 src/OpenClaw.Shared/VoiceModeSchema.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
create mode 100644 src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
create mode 100644 tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
new file mode 100644
index 0000000..664be7f
--- /dev/null
+++ b/docs/VOICE-MODE.md
@@ -0,0 +1,371 @@
+# Voice Mode Architecture
+
+This document defines the voice subsystem for the Windows node only. It introduces the command surface, persisted settings schema, and minimum runtime boundaries needed to add Windows voice support without reshaping the existing node architecture.
+
+## Goals
+
+- Add a node-local voice mode with two activation modes: `wakeword` and `alwaysOn`
+- Use NanoWakeWord for wakeword detection on-device
+- Provide parity targets with the macOS app:
+ - `WakeWord` maps to Voice Wake
+ - `AlwaysOn` maps to Talk Mode
+- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
+- Keep provider-specific STT/TTS concerns separate from the Windows node by default
+- Reuse the existing node capability pattern instead of introducing a parallel control path
+
+## Non-Goals
+
+- True full-duplex or chunk-streaming audio transport between node and gateway
+- Provider-specific STT/TTS routing in the Windows node
+- Changes to unrelated project documentation
+
+## Design Position
+
+The Windows node should own device-local audio concerns:
+
+- microphone capture
+- wakeword detection
+- silence detection / utterance segmentation
+- speaker playback
+- device enumeration and persisted local settings
+
+OpenClaw remains responsible for conversation/session routing and upstream voice orchestration.
+
+This keeps the Windows node lean for the first implementation and avoids introducing provider-routing settings before they are needed.
+
+## macOS Parity Mapping
+
+Windows voice mode aims for functional parity with the existing macOS voice surfaces:
+
+| Windows Mode | macOS Equivalent | Behavior |
+|---|---|---|
+| `WakeWord` | Voice Wake | passively listen for a trigger phrase, capture one utterance, then submit after end silence |
+| `AlwaysOn` | Talk Mode | continuous listen -> think -> speak loop with barge-in support, while still remaining turn-based rather than true simultaneous duplex audio |
+
+For v1 on Windows, `AlwaysOn` is Talk Mode parity, not a new full-duplex transport.
+The current implementation is still turn-based: listen, send transcript, wait, speak, resume listening.
+
+## Transport Boundary
+
+For macOS parity, `AlwaysOn` should follow Talk Mode's documented control flow:
+
+- the node captures audio locally
+- local speech recognition turns that audio into transcript text
+- the transcript is sent to OpenClaw via `chat.send` on the main session
+- OpenClaw returns the assistant reply as normal chat output
+- the node performs local TTS playback of that reply
+
+That means the first Windows parity target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
+
+The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists only to carry `chat.send` and assistant chat events for `AlwaysOn`.
+
+## Provider Selection
+
+Voice settings now carry explicit provider ids for both STT and TTS:
+
+- `Voice.SpeechToTextProviderId`
+- `Voice.TextToSpeechProviderId`
+
+The built-in default for both is `windows`.
+
+Runtime behavior in the current phase:
+
+- `windows` is implemented for both STT and TTS
+- non-Windows providers can be selected and persisted now
+- unsupported providers fall back to Windows at runtime with a status warning
+
+### Local Provider Catalog
+
+Additional provider entries are supplied through a local catalog file:
+
+- `%APPDATA%\\OpenClawTray\\voice-providers.json`
+
+Example:
+
+```json
+{
+ "speechToTextProviders": [
+ {
+ "id": "windows",
+ "name": "Windows Speech Recognition",
+ "runtime": "windows",
+ "enabled": true,
+ "description": "Built-in Windows dictation and speech recognition."
+ },
+ {
+ "id": "minimax",
+ "name": "MiniMax Speech To Text",
+ "runtime": "gateway",
+ "enabled": true,
+ "description": "Planned future provider."
+ }
+ ],
+ "textToSpeechProviders": [
+ {
+ "id": "windows",
+ "name": "Windows Speech Synthesis",
+ "runtime": "windows",
+ "enabled": true,
+ "description": "Built-in Windows text-to-speech playback."
+ },
+ {
+ "id": "elevenlabs",
+ "name": "ElevenLabs",
+ "runtime": "gateway",
+ "enabled": true,
+ "description": "Planned future provider."
+ }
+ ]
+}
+```
+
+This file only defines selectable providers. It does not carry API keys.
+
+### OpenClaw Configuration Discovery
+
+It may be technically possible to inspect parts of the OpenClaw configuration surface to infer preferred providers. However, the documented config protocol notes that sensitive fields have no redaction layer, so automatically pulling provider credentials into the Windows tray is not a safe default.
+
+Because of that, this design keeps provider selection local for now:
+
+- local tray settings choose the preferred STT/TTS provider ids
+- OpenClaw remains the conversation endpoint for `chat.send`
+- future provider adapters can decide whether they use local credentials, gateway-owned credentials, or both
+
+For `WakeWord`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
+
+## Command Surface
+
+The voice subsystem is introduced as a new node capability category: `voice`.
+
+### Commands
+
+| Command | Purpose | Request Payload | Response Payload |
+|---|---|---|---|
+| `voice.devices.list` | Enumerate input/output audio devices | none | `VoiceAudioDeviceInfo[]` |
+| `voice.settings.get` | Return the effective voice configuration | none | `VoiceSettings` |
+| `voice.settings.set` | Update the voice configuration | `VoiceSettingsUpdateArgs` | `VoiceSettings` |
+| `voice.status.get` | Return runtime voice status | none | `VoiceStatusInfo` |
+| `voice.start` | Start the voice runtime with the supplied or persisted mode | `VoiceStartArgs` | `VoiceStatusInfo` |
+| `voice.stop` | Stop the voice runtime | `VoiceStopArgs` | `VoiceStatusInfo` |
+
+### Payload Types
+
+- `VoiceSettings`
+- `VoiceWakeWordSettings`
+- `VoiceAlwaysOnSettings`
+- `VoiceAudioDeviceInfo`
+- `VoiceStatusInfo`
+- `VoiceStartArgs`
+- `VoiceStopArgs`
+- `VoiceSettingsUpdateArgs`
+
+These contracts are defined in [VoiceModeSchema.cs](C:/dev/openclaw-windows-node/src/OpenClaw.Shared/VoiceModeSchema.cs).
+
+## Settings Schema
+
+Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](C:/dev/openclaw-windows-node/src/OpenClaw.Shared/SettingsData.cs).
+
+### Effective Schema
+
+```json
+{
+ "Voice": {
+ "Mode": "WakeWord",
+ "Enabled": true,
+ "SpeechToTextProviderId": "windows",
+ "TextToSpeechProviderId": "windows",
+ "InputDeviceId": "default-mic",
+ "OutputDeviceId": "default-speaker",
+ "SampleRateHz": 16000,
+ "CaptureChunkMs": 80,
+ "BargeInEnabled": true,
+ "WakeWord": {
+ "Engine": "NanoWakeWord",
+ "ModelId": "hey_openclaw",
+ "TriggerThreshold": 0.65,
+ "TriggerCooldownMs": 2000,
+ "PreRollMs": 1200,
+ "EndSilenceMs": 900
+ },
+ "AlwaysOn": {
+ "MinSpeechMs": 250,
+ "EndSilenceMs": 900,
+ "MaxUtteranceMs": 15000,
+ "AutoSubmit": true
+ }
+ }
+}
+```
+
+### Field Rationale
+
+| Field | Purpose |
+|---|---|
+| `Mode` | Top-level activation mode: `Off`, `WakeWord`, `AlwaysOn` |
+| `Enabled` | Global feature kill-switch independent of mode |
+| `SpeechToTextProviderId` | Selected STT provider id from the local provider catalog |
+| `TextToSpeechProviderId` | Selected TTS provider id from the local provider catalog |
+| `InputDeviceId` / `OutputDeviceId` | Stable audio device binding |
+| `SampleRateHz` | Shared capture sample rate, fixed to a speech-friendly default |
+| `CaptureChunkMs` | Frame size for capture, VAD, and wakeword processing |
+| `BargeInEnabled` | Allows microphone capture while audio playback is active |
+| `WakeWord.*` | NanoWakeWord and post-trigger utterance capture tuning |
+| `AlwaysOn.*` | Continuous-listening segmentation tuning |
+
+### Complete Settings Definition
+
+| Setting | Type | Default | Applies To | Meaning |
+|---|---|---|---|---|
+| `Voice.Mode` | enum | `Off` | all | Activation mode: `Off`, `WakeWord`, `AlwaysOn` |
+| `Voice.Enabled` | bool | `false` | all | Master enable/disable flag for voice mode |
+| `Voice.SpeechToTextProviderId` | string | `windows` | all | Preferred speech-to-text provider id |
+| `Voice.TextToSpeechProviderId` | string | `windows` | all | Preferred text-to-speech provider id |
+| `Voice.InputDeviceId` | string? | `null` | all | Preferred microphone device id; `null` means system default |
+| `Voice.OutputDeviceId` | string? | `null` | all | Preferred speaker device id; `null` means system default |
+| `Voice.SampleRateHz` | int | `16000` | all | Internal capture rate used for wakeword, VAD, and utterance assembly |
+| `Voice.CaptureChunkMs` | int | `80` | all | Audio frame duration used by the capture loop |
+| `Voice.BargeInEnabled` | bool | `true` | all | If `true`, microphone capture may continue while response audio is playing |
+| `Voice.WakeWord.Engine` | string | `NanoWakeWord` | wakeword | Wakeword engine identifier |
+| `Voice.WakeWord.ModelId` | string | `hey_openclaw` | wakeword | Wakeword model/profile identifier |
+| `Voice.WakeWord.TriggerThreshold` | float | `0.65` | wakeword | Minimum score required to trigger wakeword activation |
+| `Voice.WakeWord.TriggerCooldownMs` | int | `2000` | wakeword | Minimum delay before another wakeword trigger is accepted |
+| `Voice.WakeWord.PreRollMs` | int | `1200` | wakeword | Buffered audio retained before the trigger point |
+| `Voice.WakeWord.EndSilenceMs` | int | `900` | wakeword | Silence timeout used to finalize the post-trigger utterance |
+| `Voice.AlwaysOn.MinSpeechMs` | int | `250` | always-on | Minimum detected speech duration before an utterance is treated as real input |
+| `Voice.AlwaysOn.EndSilenceMs` | int | `900` | always-on | Silence timeout used to finalize an utterance |
+| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
+| `Voice.AlwaysOn.AutoSubmit` | bool | `true` | always-on | If `true`, completed utterances are submitted immediately without extra confirmation |
+
+At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
+
+## Component Architecture
+
+```mermaid
+flowchart LR
+ A["NodeService<br/>control + lifecycle"] --> B["VoiceCapability<br/>command surface"]
+ B --> C["VoiceCoordinator<br/>runtime state machine"]
+ C --> D["SpeechRecognizer<br/>Windows continuous dictation"]
+ C --> E["WakeWordService<br/>NanoWakeWord scores"]
+ C --> F["VoiceActivityDetector<br/>speech/silence segments"]
+ C --> G["VoiceTransport<br/>operator sidecar + chat.send exchange"]
+ C --> H["SpeechSynthesizer + MediaPlayer<br/>reply playback"]
+ B --> I["SettingsManager / SettingsData.Voice<br/>persisted config JSON"]
+```
+
+## Runtime Data Flow
+
+### Wakeword Mode
+
+```mermaid
+flowchart TD
+ A["Microphone device<br/>float/PCM hardware frames"] --> B["AudioCaptureService<br/>PCM16 mono 16kHz chunks"]
+ B --> C["Ring Buffer<br/>bounded pre-roll PCM16 frames"]
+ B --> D["WakeWordService (NanoWakeWord)<br/>wake score per chunk"]
+ D --> E{"score >= trigger threshold?"}
+ E -- "no" --> B
+ E -- "yes" --> F["VoiceCoordinator<br/>WakeWordDetected(session state change)"]
+ F --> G["UtteranceAssembler<br/>seed with pre-roll PCM16 from Ring Buffer"]
+ C --> G
+ B --> G
+ G --> H["VoiceActivityDetector<br/>speech/silence state from PCM16 chunks"]
+ H --> I{"speech still active?"}
+ I -- "yes" --> B
+ I -- "no, end silence reached" --> J["Finalize utterance<br/>PCM16 buffer + timing metadata"]
+ J --> K["SpeechRecognizer<br/>utterance PCM16 -> transcript text"]
+ K --> L["VoiceTransport<br/>chat.send(main, transcript)"]
+ L --> M["OpenClaw conversation pipeline<br/>assistant reply text"]
+ M --> N["AudioPlaybackService<br/>TTS output bytes / decoded PCM"]
+ N --> O["Speaker device<br/>rendered audio"]
+ O --> P{"barge-in enabled?"}
+ P -- "yes" --> B
+ P -- "no, playback complete" --> B
+```
+
+### Always-On Mode
+
+```mermaid
+flowchart TD
+ A["Windows speech input<br/>default microphone path"] --> B["SpeechRecognizer<br/>continuous dictation result text"]
+ B --> C{"final recognized text?"}
+ C -- "no" --> A
+ C -- "yes" --> D["VoiceCoordinator<br/>pause listening and mark AwaitingResponse"]
+ D --> E["VoiceTransport<br/>chat.send(main, transcript)"]
+ E --> F["OpenClaw conversation pipeline<br/>assistant reply text"]
+ F --> G["SpeechSynthesizer<br/>assistant text -> audio stream"]
+ G --> H["MediaPlayer<br/>reply playback"]
+ H --> I["Windows audio output<br/>default speaker path"]
+ I --> J["VoiceCoordinator<br/>resume continuous listening"]
+ J --> A
+```
+
+## Processing Stages and Data Types
+
+| Stage | Component | Input | Output |
+|---|---|---|---|
+| 1 | `SpeechRecognizer` | Windows microphone capture | recognized transcript text |
+| 2a | `WakeWordService` | PCM16 chunk | wake score / trigger decision |
+| 2b | `VoiceActivityDetector` | PCM16 chunk | speech/silence state |
+| 3 | `Ring Buffer` | PCM16 chunk stream | bounded pre-roll PCM16 window |
+| 4 | `UtteranceAssembler` | pre-roll + live PCM16 chunks | utterance PCM16 buffer |
+| 5 | `SpeechRecognizer` | utterance PCM16 + timing metadata | transcript text |
+| 6 | `VoiceTransport` | transcript text + session key | `chat.send` request / assistant reply text |
+| 7 | `SpeechSynthesizer + MediaPlayer` | assistant reply text | speaker render stream |
+
+## Control Flow
+
+```mermaid
+sequenceDiagram
+ participant Gateway as Gateway / Operator
+ participant VoiceCap as VoiceCapability
+ participant Coord as VoiceCoordinator
+ participant Store as SettingsData.Voice
+
+ Gateway->>VoiceCap: voice.settings.get
+ VoiceCap-->>Gateway: VoiceSettings
+
+ Gateway->>VoiceCap: voice.settings.set(settings, persist=true)
+ VoiceCap->>Store: save VoiceSettings
+ VoiceCap-->>Gateway: VoiceSettings
+
+ Gateway->>VoiceCap: voice.start(mode=WakeWord, sessionKey=...)
+ VoiceCap->>Coord: Start(VoiceStartArgs)
+ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForWakeWord)
+ VoiceCap-->>Gateway: VoiceStatusInfo
+
+ Gateway->>VoiceCap: voice.status.get
+ VoiceCap-->>Gateway: VoiceStatusInfo
+
+ Gateway->>VoiceCap: voice.stop(reason=...)
+ VoiceCap->>Coord: Stop()
+ Coord-->>VoiceCap: VoiceStatusInfo(state=Stopped)
+ VoiceCap-->>Gateway: VoiceStatusInfo
+```
+
+## Integration Boundaries
+
+### Existing Components Reused
+
+- `NodeService` remains the capability registration and lifecycle owner
+- `SettingsData` remains the persisted JSON settings model
+- `WindowsNodeClient` remains the gateway/node transport
+- existing node capability registration remains the integration pattern
+- current request/response transport remains the v1 control plane
+- `AlwaysOn` parity should reuse existing `chat.send` message flow instead of inventing an audio-upload protocol
+
+### New Components Expected Later
+
+- `VoiceCapability` in `OpenClaw.Shared.Capabilities`
+- `AudioCaptureService` in `OpenClaw.Tray.WinUI.Services`
+- `WakeWordService` in `OpenClaw.Tray.WinUI.Services`
+- `VoiceCoordinator` in `OpenClaw.Tray.WinUI.Services`
+- `AudioPlaybackService` in `OpenClaw.Tray.WinUI.Services`
+
+## Why Provider Support Is Abstracted
+
+Minimax and ElevenLabs are valid future targets, but binding provider choice into the Windows node now would introduce:
+
+- duplicated provider integration work already handled by OpenClaw
+- local credential management on Windows
+- tighter coupling between node runtime and vendor APIs
+
+For the first implementation, the Windows node should manage local audio behavior, local speech recognition, and local playback while reusing existing OpenClaw message flows for conversation. If provider routing becomes a real requirement later, it can be added back without changing the core activation-mode model.
diff --git a/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs b/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
new file mode 100644
index 0000000..37d98fa
--- /dev/null
+++ b/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
@@ -0,0 +1,174 @@
+using System;
+using System.Collections.Generic;
+using System.Text.Json;
+using System.Threading.Tasks;
+
+namespace OpenClaw.Shared.Capabilities;
+
+public class VoiceCapability : NodeCapabilityBase
+{
+ private static readonly JsonSerializerOptions s_jsonOptions = new()
+ {
+ PropertyNameCaseInsensitive = true
+ };
+
+ public override string Category => "voice";
+
+ public override IReadOnlyList<string> Commands => VoiceCommands.All;
+
+ public event Func<Task<VoiceAudioDeviceInfo[]>>? ListDevicesRequested;
+ public event Func<Task<VoiceSettings>>? SettingsRequested;
+ public event Func<VoiceSettingsUpdateArgs, Task<VoiceSettings>>? SettingsUpdateRequested;
+ public event Func<Task<VoiceStatusInfo>>? StatusRequested;
+ public event Func<VoiceStartArgs, Task<VoiceStatusInfo>>? StartRequested;
+ public event Func<VoiceStopArgs, Task<VoiceStatusInfo>>? StopRequested;
+
+ public VoiceCapability(IOpenClawLogger logger) : base(logger)
+ {
+ }
+
+ public override async Task<NodeInvokeResponse> ExecuteAsync(NodeInvokeRequest request)
+ {
+ return request.Command switch
+ {
+ VoiceCommands.ListDevices => await HandleListDevicesAsync(),
+ VoiceCommands.GetSettings => await HandleGetSettingsAsync(),
+ VoiceCommands.SetSettings => await HandleSetSettingsAsync(request),
+ VoiceCommands.GetStatus => await HandleGetStatusAsync(),
+ VoiceCommands.Start => await HandleStartAsync(request),
+ VoiceCommands.Stop => await HandleStopAsync(request),
+ _ => Error($"Unknown command: {request.Command}")
+ };
+ }
+
+ private async Task<NodeInvokeResponse> HandleListDevicesAsync()
+ {
+ Logger.Info(VoiceCommands.ListDevices);
+
+ if (ListDevicesRequested == null)
+ return Error("Voice device enumeration not available");
+
+ try
+ {
+ return Success(await ListDevicesRequested());
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice device enumeration failed", ex);
+ return Error($"Device enumeration failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleGetSettingsAsync()
+ {
+ Logger.Info(VoiceCommands.GetSettings);
+
+ if (SettingsRequested == null)
+ return Error("Voice settings not available");
+
+ try
+ {
+ return Success(await SettingsRequested());
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice settings get failed", ex);
+ return Error($"Get settings failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleSetSettingsAsync(NodeInvokeRequest request)
+ {
+ Logger.Info(VoiceCommands.SetSettings);
+
+ if (SettingsUpdateRequested == null)
+ return Error("Voice settings update not available");
+
+ try
+ {
+ var rawArgs = request.Args.ValueKind is JsonValueKind.Undefined or JsonValueKind.Null
+ ? "{}"
+ : request.Args.GetRawText();
+ VoiceSettingsUpdateArgs? update = null;
+ if (request.Args.ValueKind == JsonValueKind.Object &&
+ request.Args.TryGetProperty("update", out var updateEl))
+ {
+ update = JsonSerializer.Deserialize<VoiceSettingsUpdateArgs>(updateEl.GetRawText(), s_jsonOptions);
+ }
+
+ update ??= JsonSerializer.Deserialize<VoiceSettingsUpdateArgs>(rawArgs, s_jsonOptions);
+
+ if (update == null)
+ return Error("Missing update payload");
+
+ return Success(await SettingsUpdateRequested(update));
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice settings update failed", ex);
+ return Error($"Set settings failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleGetStatusAsync()
+ {
+ Logger.Info(VoiceCommands.GetStatus);
+
+ if (StatusRequested == null)
+ return Error("Voice status not available");
+
+ try
+ {
+ return Success(await StatusRequested());
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice status get failed", ex);
+ return Error($"Get status failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleStartAsync(NodeInvokeRequest request)
+ {
+ Logger.Info(VoiceCommands.Start);
+
+ if (StartRequested == null)
+ return Error("Voice start not available");
+
+ try
+ {
+ var rawArgs = request.Args.ValueKind is JsonValueKind.Undefined or JsonValueKind.Null
+ ? "{}"
+ : request.Args.GetRawText();
+ var args = JsonSerializer.Deserialize<VoiceStartArgs>(rawArgs, s_jsonOptions) ?? new VoiceStartArgs();
+ return Success(await StartRequested(args));
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice start failed", ex);
+ return Error($"Start failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleStopAsync(NodeInvokeRequest request)
+ {
+ Logger.Info(VoiceCommands.Stop);
+
+ if (StopRequested == null)
+ return Error("Voice stop not available");
+
+ try
+ {
+ var rawArgs = request.Args.ValueKind is JsonValueKind.Undefined or JsonValueKind.Null
+ ? "{}"
+ : request.Args.GetRawText();
+ var args = JsonSerializer.Deserialize<VoiceStopArgs>(rawArgs, s_jsonOptions) ?? new VoiceStopArgs();
+ return Success(await StopRequested(args));
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice stop failed", ex);
+ return Error($"Stop failed: {ex.Message}");
+ }
+ }
+}
diff --git a/src/OpenClaw.Shared/SettingsData.cs b/src/OpenClaw.Shared/SettingsData.cs
index 4c7b075..c7af724 100644
--- a/src/OpenClaw.Shared/SettingsData.cs
+++ b/src/OpenClaw.Shared/SettingsData.cs
@@ -26,6 +26,7 @@ public class SettingsData
public bool NotifyChatResponses { get; set; } = true;
public bool PreferStructuredCategories { get; set; } = true;
public List<UserNotificationRule>? UserRules { get; set; }
+ public VoiceSettings Voice { get; set; } = new();
private static readonly JsonSerializerOptions s_options = new() { WriteIndented = true };
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
new file mode 100644
index 0000000..0dce2a5
--- /dev/null
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -0,0 +1,144 @@
+using System.Collections.ObjectModel;
+using System.Text.Json.Serialization;
+
+namespace OpenClaw.Shared;
+
+public static class VoiceCommands
+{
+ public const string ListDevices = "voice.devices.list";
+ public const string GetSettings = "voice.settings.get";
+ public const string SetSettings = "voice.settings.set";
+ public const string GetStatus = "voice.status.get";
+ public const string Start = "voice.start";
+ public const string Stop = "voice.stop";
+
+ private static readonly ReadOnlyCollection<string> s_all = Array.AsReadOnly(
+ [
+ ListDevices,
+ GetSettings,
+ SetSettings,
+ GetStatus,
+ Start,
+ Stop
+ ]);
+
+ public static IReadOnlyList<string> All => s_all;
+}
+
+[JsonConverter(typeof(JsonStringEnumConverter<VoiceActivationMode>))]
+public enum VoiceActivationMode
+{
+ Off,
+ WakeWord,
+ AlwaysOn
+}
+
+[JsonConverter(typeof(JsonStringEnumConverter<VoiceRuntimeState>))]
+public enum VoiceRuntimeState
+{
+ Stopped,
+ Idle,
+ Arming,
+ ListeningForWakeWord,
+ ListeningContinuously,
+ RecordingUtterance,
+ SubmittingAudio,
+ AwaitingResponse,
+ PlayingResponse,
+ Error
+}
+
+public sealed class VoiceSettings
+{
+ public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
+ public bool Enabled { get; set; }
+ public string SpeechToTextProviderId { get; set; } = VoiceProviderIds.Windows;
+ public string TextToSpeechProviderId { get; set; } = VoiceProviderIds.Windows;
+ public string? InputDeviceId { get; set; }
+ public string? OutputDeviceId { get; set; }
+ public int SampleRateHz { get; set; } = 16000;
+ public int CaptureChunkMs { get; set; } = 80;
+ public bool BargeInEnabled { get; set; } = true;
+ public VoiceWakeWordSettings WakeWord { get; set; } = new();
+ public VoiceAlwaysOnSettings AlwaysOn { get; set; } = new();
+}
+
+public sealed class VoiceWakeWordSettings
+{
+ public string Engine { get; set; } = "NanoWakeWord";
+ public string ModelId { get; set; } = "hey_openclaw";
+ public float TriggerThreshold { get; set; } = 0.65f;
+ public int TriggerCooldownMs { get; set; } = 2000;
+ public int PreRollMs { get; set; } = 1200;
+ public int EndSilenceMs { get; set; } = 900;
+}
+
+public sealed class VoiceAlwaysOnSettings
+{
+ public int MinSpeechMs { get; set; } = 250;
+ public int EndSilenceMs { get; set; } = 900;
+ public int MaxUtteranceMs { get; set; } = 15000;
+ public bool AutoSubmit { get; set; } = true;
+}
+
+public sealed class VoiceAudioDeviceInfo
+{
+ public string DeviceId { get; set; } = "";
+ public string Name { get; set; } = "";
+ public bool IsDefault { get; set; }
+ public bool IsInput { get; set; }
+ public bool IsOutput { get; set; }
+}
+
+public sealed class VoiceStatusInfo
+{
+ public bool Available { get; set; }
+ public bool Running { get; set; }
+ public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
+ public VoiceRuntimeState State { get; set; } = VoiceRuntimeState.Stopped;
+ public string? SessionKey { get; set; }
+ public string? InputDeviceId { get; set; }
+ public string? OutputDeviceId { get; set; }
+ public string? WakeWordModelId { get; set; }
+ public bool WakeWordLoaded { get; set; }
+ public DateTime? LastWakeWordUtc { get; set; }
+ public DateTime? LastUtteranceUtc { get; set; }
+ public string? LastError { get; set; }
+}
+
+public sealed class VoiceStartArgs
+{
+ public VoiceActivationMode? Mode { get; set; }
+ public string? SessionKey { get; set; }
+}
+
+public sealed class VoiceStopArgs
+{
+ public string? Reason { get; set; }
+}
+
+public sealed class VoiceSettingsUpdateArgs
+{
+ public VoiceSettings Settings { get; set; } = new();
+ public bool Persist { get; set; } = true;
+}
+
+public static class VoiceProviderIds
+{
+ public const string Windows = "windows";
+}
+
+public sealed class VoiceProviderOption
+{
+ public string Id { get; set; } = "";
+ public string Name { get; set; } = "";
+ public string Runtime { get; set; } = "windows";
+ public bool Enabled { get; set; } = true;
+ public string? Description { get; set; }
+}
+
+public sealed class VoiceProviderCatalog
+{
+ public List<VoiceProviderOption> SpeechToTextProviders { get; set; } = [];
+ public List<VoiceProviderOption> TextToSpeechProviders { get; set; } = [];
+}
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index de0780f..37552b9 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -63,6 +63,7 @@ public partial class App : Application
// Windows (created on demand)
private SettingsWindow? _settingsWindow;
+ private VoiceModeWindow? _voiceModeWindow;
private WebChatWindow? _webChatWindow;
private StatusDetailWindow? _statusDetailWindow;
private NotificationHistoryWindow? _notificationHistoryWindow;
@@ -72,6 +73,7 @@ public partial class App : Application
// Node service (optional, enabled in settings)
private NodeService? _nodeService;
+ private VoiceService? _voiceService;
// Keep-alive window to anchor WinUI runtime (prevents GC/threading issues)
private Window? _keepAliveWindow;
@@ -250,6 +252,7 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
// Initialize settings
_settings = new SettingsManager();
+ _voiceService = new VoiceService(new AppLogger(), _settings);
// First-run check
if (string.IsNullOrWhiteSpace(_settings.Token))
@@ -514,6 +517,7 @@ private void OnTrayMenuItemClicked(object? sender, string action)
switch (action)
{
case "status": ShowStatusDetail(); break;
+ case "voice-settings": ShowVoiceModeSettings(); break;
case "dashboard": OpenDashboard(); break;
case "webchat": ShowWebChat(); break;
case "quicksend": ShowQuickSend(); break;
@@ -725,6 +729,33 @@ private List<string> GetRecentActivity(int maxItems)
.ToList();
}
+ private string GetRunningVoiceModeLabel()
+ {
+ var status = _voiceService?.CurrentStatus;
+ if (status?.Running == true)
+ {
+ return status.Mode switch
+ {
+ VoiceActivationMode.WakeWord => "WakeWord",
+ VoiceActivationMode.AlwaysOn => "AlwaysOn",
+ _ => "Off"
+ };
+ }
+
+ return "Off";
+ }
+
+ private string GetVoiceDeviceSummary()
+ {
+ var voice = _settings?.Voice;
+ if (voice == null)
+ return "Talk: system default ┬╖ Listen: system default";
+
+ var talk = string.IsNullOrWhiteSpace(voice.OutputDeviceId) ? "system default" : "selected speaker";
+ var listen = string.IsNullOrWhiteSpace(voice.InputDeviceId) ? "system default" : "selected microphone";
+ return $"Talk: {talk} ┬╖ Listen: {listen}";
+ }
+
private void BuildTrayMenuPopup(TrayMenuWindow menu)
{
// Brand header
@@ -741,6 +772,13 @@ private void BuildTrayMenuPopup(TrayMenuWindow menu)
menu.AddMenuItem(_currentActivity.DisplayText, _currentActivity.Glyph, "", isEnabled: false);
}
+ menu.AddMenuItem($"Voice Mode: {GetRunningVoiceModeLabel()}", "🎙️", "voice-settings");
+ menu.AddMenuItem($"↳ {GetVoiceDeviceSummary()}", "", "", isEnabled: false, indent: true);
+ if (_settings?.EnableNodeMode != true)
+ {
+ menu.AddMenuItem("↳ Enable Node Mode to activate voice runtime", "", "", isEnabled: false, indent: true);
+ }
+
// Usage
if (_lastUsage != null || _lastUsageStatus != null || _lastUsageCost != null)
{
@@ -1126,7 +1164,7 @@ private void InitializeNodeService()
{
Logger.Info("Initializing Windows Node service...");
- _nodeService = new NodeService(new AppLogger(), _dispatcherQueue, DataPath);
+ _nodeService = new NodeService(new AppLogger(), _dispatcherQueue, _voiceService!, DataPath);
_nodeService.StatusChanged += OnNodeStatusChanged;
_nodeService.NotificationRequested += OnNodeNotificationRequested;
_nodeService.PairingStatusChanged += OnPairingStatusChanged;
@@ -1601,6 +1639,20 @@ private void ShowSettings()
_settingsWindow.Activate();
}
+ private void ShowVoiceModeSettings()
+ {
+ if (_settings == null || _voiceService == null)
+ return;
+
+ if (_voiceModeWindow == null || _voiceModeWindow.IsClosed)
+ {
+ _voiceModeWindow = new VoiceModeWindow(_settings, _voiceService);
+ _voiceModeWindow.Closed += (s, e) => _voiceModeWindow = null;
+ }
+
+ _voiceModeWindow.Activate();
+ }
+
private void OnSettingsSaved(object? sender, EventArgs e)
{
// Reconnect with new settings ΓÇö mirror the startup if/else pattern
@@ -1617,6 +1669,7 @@ private void OnSettingsSaved(object? sender, EventArgs e)
else
{
InitializeGatewayClient();
+ _ = _voiceService?.StopAsync(new VoiceStopArgs { Reason = "Node mode disabled" });
}
// Update global hotkey
@@ -2070,6 +2123,7 @@ private void ExitApplication()
// Dispose cancellation token source
_deepLinkCts?.Dispose();
+ _voiceService?.Dispose();
Exit();
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
index 1bd3883..62ea080 100644
--- a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
@@ -21,6 +21,7 @@ public class NodeService : IDisposable
private CanvasWindow? _canvasWindow;
private ScreenCaptureService? _screenCaptureService;
private CameraCaptureService? _cameraCaptureService;
+ private VoiceService? _voiceService;
private DateTime _lastScreenCaptureNotification = DateTime.MinValue;
private string? _a2uiHostUrl;
@@ -29,6 +30,7 @@ public class NodeService : IDisposable
private CanvasCapability? _canvasCapability;
private ScreenCapability? _screenCapability;
private CameraCapability? _cameraCapability;
+ private VoiceCapability? _voiceCapability;
private readonly string _dataPath;
// Events
@@ -44,13 +46,14 @@ public class NodeService : IDisposable
public string? FullDeviceId => _nodeClient?.FullDeviceId;
public string? GatewayUrl => _nodeClient?.GatewayUrl;
- public NodeService(IOpenClawLogger logger, DispatcherQueue dispatcherQueue, string dataPath)
+ public NodeService(IOpenClawLogger logger, DispatcherQueue dispatcherQueue, VoiceService voiceService, string dataPath)
{
_logger = logger;
_dispatcherQueue = dispatcherQueue;
_dataPath = dataPath;
_screenCaptureService = new ScreenCaptureService(logger);
_cameraCaptureService = new CameraCaptureService(logger);
+ _voiceService = voiceService;
}
/// <summary>
@@ -79,6 +82,15 @@ public async Task ConnectAsync(string gatewayUrl, string token)
await _nodeClient.ConnectAsync();
_a2uiHostUrl = BuildA2UIHostUrl(_nodeClient.GatewayUrl);
+
+ if (_voiceService != null)
+ {
+ var settings = await _voiceService.GetSettingsAsync();
+ if (settings.Enabled && settings.Mode != VoiceActivationMode.Off)
+ {
+ await _voiceService.StartAsync(new VoiceStartArgs { Mode = settings.Mode });
+ }
+ }
}
/// <summary>
@@ -92,6 +104,11 @@ public async Task DisconnectAsync()
_nodeClient.Dispose();
_nodeClient = null;
}
+
+ if (_voiceService != null)
+ {
+ await _voiceService.StopAsync(new VoiceStopArgs { Reason = "Node disconnected" });
+ }
// Close canvas window
if (_canvasWindow != null && !_canvasWindow.IsClosed)
@@ -134,6 +151,16 @@ private void RegisterCapabilities()
_cameraCapability.ListRequested += OnCameraList;
_cameraCapability.SnapRequested += OnCameraSnap;
_nodeClient.RegisterCapability(_cameraCapability);
+
+ // Voice capability
+ _voiceCapability = new VoiceCapability(_logger);
+ _voiceCapability.ListDevicesRequested += OnVoiceListDevices;
+ _voiceCapability.SettingsRequested += OnVoiceGetSettings;
+ _voiceCapability.SettingsUpdateRequested += OnVoiceSetSettings;
+ _voiceCapability.StatusRequested += OnVoiceGetStatus;
+ _voiceCapability.StartRequested += OnVoiceStart;
+ _voiceCapability.StopRequested += OnVoiceStop;
+ _nodeClient.RegisterCapability(_voiceCapability);
_logger.Info("All capabilities registered");
}
@@ -474,6 +501,58 @@ private async Task<CameraSnapResult> OnCameraSnap(CameraSnapArgs args)
}
}
+ #endregion
+
+ #region Voice Capability Handlers
+
+ private Task<VoiceAudioDeviceInfo[]> OnVoiceListDevices()
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.ListDevicesAsync();
+ }
+
+ private Task<VoiceSettings> OnVoiceGetSettings()
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.GetSettingsAsync();
+ }
+
+ private Task<VoiceSettings> OnVoiceSetSettings(VoiceSettingsUpdateArgs args)
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.UpdateSettingsAsync(args);
+ }
+
+ private Task<VoiceStatusInfo> OnVoiceGetStatus()
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.GetStatusAsync();
+ }
+
+ private Task<VoiceStatusInfo> OnVoiceStart(VoiceStartArgs args)
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.StartAsync(args);
+ }
+
+ private Task<VoiceStatusInfo> OnVoiceStop(VoiceStopArgs args)
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.StopAsync(args);
+ }
+
#endregion
public void Dispose()
@@ -483,7 +562,6 @@ public void Dispose()
try { client?.Dispose(); } catch { /* ignore */ }
try { _cameraCaptureService?.Dispose(); } catch { /* ignore */ }
-
if (_canvasWindow != null && !_canvasWindow.IsClosed)
{
var window = _canvasWindow;
diff --git a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
index 0c343f1..2fc93d7 100644
--- a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
@@ -42,6 +42,7 @@ public class SettingsManager
public bool NotifyChatResponses { get; set; } = true;
public bool PreferStructuredCategories { get; set; } = true;
public List<OpenClaw.Shared.UserNotificationRule> UserRules { get; set; } = new();
+ public VoiceSettings Voice { get; set; } = new();
// Node mode (enables Windows as a node, not just operator)
public bool EnableNodeMode { get; set; } = false;
@@ -82,6 +83,7 @@ public void Load()
PreferStructuredCategories = loaded.PreferStructuredCategories;
if (loaded.UserRules != null)
UserRules = loaded.UserRules;
+ Voice = loaded.Voice ?? new VoiceSettings();
}
}
}
@@ -117,7 +119,8 @@ public void Save()
HasSeenActivityStreamTip = HasSeenActivityStreamTip,
NotifyChatResponses = NotifyChatResponses,
PreferStructuredCategories = PreferStructuredCategories,
- UserRules = UserRules
+ UserRules = UserRules,
+ Voice = Voice
};
var json = data.ToJson();
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
new file mode 100644
index 0000000..705336e
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
@@ -0,0 +1,155 @@
+using System;
+using System.Collections.Generic;
+using System.IO;
+using System.Linq;
+using System.Text.Json;
+using OpenClaw.Shared;
+
+namespace OpenClawTray.Services;
+
+public static class VoiceProviderCatalogService
+{
+ private static readonly string s_catalogFilePath = Path.Combine(
+ Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData),
+ "OpenClawTray",
+ "voice-providers.json");
+
+ private static readonly JsonSerializerOptions s_jsonOptions = new()
+ {
+ PropertyNameCaseInsensitive = true,
+ WriteIndented = true
+ };
+
+ public static string CatalogFilePath => s_catalogFilePath;
+
+ public static VoiceProviderCatalog LoadCatalog(IOpenClawLogger? logger = null)
+ {
+ var merged = CreateBuiltInCatalog();
+
+ try
+ {
+ if (!File.Exists(s_catalogFilePath))
+ {
+ return merged;
+ }
+
+ var json = File.ReadAllText(s_catalogFilePath);
+ var configured = JsonSerializer.Deserialize<VoiceProviderCatalog>(json, s_jsonOptions);
+ if (configured == null)
+ {
+ return merged;
+ }
+
+ merged.SpeechToTextProviders = MergeProviders(
+ merged.SpeechToTextProviders,
+ configured.SpeechToTextProviders);
+ merged.TextToSpeechProviders = MergeProviders(
+ merged.TextToSpeechProviders,
+ configured.TextToSpeechProviders);
+ }
+ catch (Exception ex)
+ {
+ logger?.Warn($"Failed to load voice provider catalog: {ex.Message}");
+ }
+
+ return merged;
+ }
+
+ public static VoiceProviderOption ResolveSpeechToTextProvider(string? providerId, IOpenClawLogger? logger = null)
+ {
+ var catalog = LoadCatalog(logger);
+ return ResolveProvider(catalog.SpeechToTextProviders, providerId);
+ }
+
+ public static VoiceProviderOption ResolveTextToSpeechProvider(string? providerId, IOpenClawLogger? logger = null)
+ {
+ var catalog = LoadCatalog(logger);
+ return ResolveProvider(catalog.TextToSpeechProviders, providerId);
+ }
+
+ public static bool SupportsWindowsRuntime(string? providerId)
+ {
+ return string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
+ }
+
+ private static VoiceProviderCatalog CreateBuiltInCatalog()
+ {
+ return new VoiceProviderCatalog
+ {
+ SpeechToTextProviders =
+ [
+ new VoiceProviderOption
+ {
+ Id = VoiceProviderIds.Windows,
+ Name = "Windows Speech Recognition",
+ Runtime = "windows",
+ Description = "Built-in Windows dictation and speech recognition."
+ }
+ ],
+ TextToSpeechProviders =
+ [
+ new VoiceProviderOption
+ {
+ Id = VoiceProviderIds.Windows,
+ Name = "Windows Speech Synthesis",
+ Runtime = "windows",
+ Description = "Built-in Windows text-to-speech playback."
+ }
+ ]
+ };
+ }
+
+ private static List<VoiceProviderOption> MergeProviders(
+ List<VoiceProviderOption> builtIn,
+ List<VoiceProviderOption> configured)
+ {
+ var merged = builtIn
+ .Select(Clone)
+ .ToDictionary(p => p.Id, StringComparer.OrdinalIgnoreCase);
+
+ foreach (var provider in configured.Where(p => !string.IsNullOrWhiteSpace(p.Id)))
+ {
+ merged[provider.Id] = Clone(provider);
+ }
+
+ return merged.Values
+ .Where(p => p.Enabled)
+ .OrderByDescending(p => string.Equals(p.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ .ThenBy(p => p.Name, StringComparer.OrdinalIgnoreCase)
+ .ToList();
+ }
+
+ private static VoiceProviderOption ResolveProvider(IEnumerable<VoiceProviderOption> providers, string? providerId)
+ {
+ if (!string.IsNullOrWhiteSpace(providerId))
+ {
+ var configured = providers.FirstOrDefault(p => string.Equals(p.Id, providerId, StringComparison.OrdinalIgnoreCase));
+ if (configured != null)
+ {
+ return Clone(configured);
+ }
+ }
+
+ return providers
+ .Select(Clone)
+ .FirstOrDefault(p => string.Equals(p.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ ?? new VoiceProviderOption
+ {
+ Id = VoiceProviderIds.Windows,
+ Name = "Windows Speech",
+ Runtime = "windows"
+ };
+ }
+
+ private static VoiceProviderOption Clone(VoiceProviderOption source)
+ {
+ return new VoiceProviderOption
+ {
+ Id = source.Id,
+ Name = source.Name,
+ Runtime = source.Runtime,
+ Enabled = source.Enabled,
+ Description = source.Description
+ };
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
new file mode 100644
index 0000000..3d3e982
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
@@ -0,0 +1,1040 @@
+using System;
+using System.Collections.Generic;
+using System.Linq;
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+using OpenClawTray.Helpers;
+using Windows.Devices.Enumeration;
+using Windows.Foundation;
+using Windows.Media.Capture;
+using Windows.Media.Core;
+using Windows.Media.Devices;
+using Windows.Media.Playback;
+using Windows.Media.SpeechRecognition;
+using Windows.Media.SpeechSynthesis;
+
+namespace OpenClawTray.Services;
+
+public sealed class VoiceService : IDisposable
+{
+ private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
+ private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
+ private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
+ private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromSeconds(2);
+
+ private readonly IOpenClawLogger _logger;
+ private readonly SettingsManager _settings;
+ private readonly object _gate = new();
+
+ private VoiceStatusInfo _status;
+ private VoiceActivationMode? _runtimeModeOverride;
+ private CancellationTokenSource? _runtimeCts;
+ private OpenClawGatewayClient? _chatClient;
+ private ConnectionStatus _chatTransportStatus = ConnectionStatus.Disconnected;
+ private TaskCompletionSource<bool>? _transportReadyTcs;
+ private SpeechRecognizer? _speechRecognizer;
+ private SpeechSynthesizer? _speechSynthesizer;
+ private MediaPlayer? _mediaPlayer;
+ private bool _recognitionActive;
+ private bool _awaitingReply;
+ private bool _isSpeaking;
+ private string? _lastTranscript;
+ private DateTime _lastTranscriptUtc;
+ private bool _disposed;
+
+ public VoiceService(IOpenClawLogger logger, SettingsManager settings)
+ {
+ _logger = logger;
+ _settings = settings;
+ _status = new VoiceStatusInfo();
+ _status = BuildStoppedStatus(null, null);
+ }
+
+ public VoiceStatusInfo CurrentStatus
+ {
+ get
+ {
+ lock (_gate)
+ {
+ return Clone(_status);
+ }
+ }
+ }
+
+ public Task<VoiceSettings> GetSettingsAsync()
+ {
+ lock (_gate)
+ {
+ return Task.FromResult(Clone(_settings.Voice));
+ }
+ }
+
+ public Task<VoiceSettings> UpdateSettingsAsync(VoiceSettingsUpdateArgs update)
+ {
+ ArgumentNullException.ThrowIfNull(update);
+
+ lock (_gate)
+ {
+ _settings.Voice = Clone(update.Settings);
+ if (update.Persist)
+ {
+ _settings.Save();
+ }
+
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ _runtimeModeOverride ?? _settings.Voice.Mode,
+ _status.SessionKey,
+ _status.State,
+ _status.LastError);
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ _status.LastWakeWordUtc = _status.LastWakeWordUtc;
+ }
+ else
+ {
+ _status = BuildStoppedStatus(_status.SessionKey, _status.LastError);
+ }
+
+ return Task.FromResult(Clone(_settings.Voice));
+ }
+ }
+
+ public Task<VoiceStatusInfo> GetStatusAsync()
+ {
+ lock (_gate)
+ {
+ return Task.FromResult(Clone(_status));
+ }
+ }
+
+ public async Task<VoiceStatusInfo> StartAsync(VoiceStartArgs args)
+ {
+ ObjectDisposedException.ThrowIf(_disposed, this);
+
+ args ??= new VoiceStartArgs();
+
+ VoiceSettings effectiveSettings;
+ VoiceActivationMode requestedMode;
+ string? sessionKey;
+
+ lock (_gate)
+ {
+ effectiveSettings = Clone(_settings.Voice);
+ requestedMode = args.Mode ?? effectiveSettings.Mode;
+ sessionKey = args.SessionKey ?? _status.SessionKey;
+
+ if (args.Mode.HasValue && args.Mode.Value != VoiceActivationMode.Off)
+ {
+ effectiveSettings.Enabled = true;
+ effectiveSettings.Mode = args.Mode.Value;
+ _runtimeModeOverride = args.Mode.Value;
+ }
+ else if (args.Mode == VoiceActivationMode.Off)
+ {
+ _runtimeModeOverride = null;
+ }
+
+ if (!effectiveSettings.Enabled || requestedMode == VoiceActivationMode.Off)
+ {
+ _status = BuildStoppedStatus(sessionKey, "Voice mode is disabled");
+ return Clone(_status);
+ }
+ }
+
+ await StopRuntimeResourcesAsync(updateStoppedStatus: false);
+
+ try
+ {
+ switch (requestedMode)
+ {
+ case VoiceActivationMode.AlwaysOn:
+ await StartAlwaysOnRuntimeAsync(effectiveSettings, sessionKey);
+ break;
+ case VoiceActivationMode.WakeWord:
+ lock (_gate)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.WakeWord,
+ sessionKey,
+ VoiceRuntimeState.ListeningForWakeWord,
+ "WakeWord capture is not implemented yet");
+ }
+ _logger.Info("Voice runtime started in mode WakeWord");
+ break;
+ default:
+ lock (_gate)
+ {
+ _status = BuildStoppedStatus(sessionKey, "Voice mode is disabled");
+ }
+ break;
+ }
+ }
+ catch (Exception ex)
+ {
+ _logger.Error("Voice runtime start failed", ex);
+ lock (_gate)
+ {
+ _status = BuildErrorStatus(requestedMode, sessionKey, GetUserFacingErrorMessage(ex));
+ }
+ }
+
+ return CurrentStatus;
+ }
+
+ public async Task<VoiceStatusInfo> StopAsync(VoiceStopArgs args)
+ {
+ args ??= new VoiceStopArgs();
+
+ await StopRuntimeResourcesAsync(updateStoppedStatus: false);
+
+ lock (_gate)
+ {
+ _runtimeModeOverride = null;
+ _status = BuildStoppedStatus(_status.SessionKey, args.Reason);
+ _logger.Info($"Voice runtime stopped{(string.IsNullOrWhiteSpace(args.Reason) ? string.Empty : $": {args.Reason}")}");
+ return Clone(_status);
+ }
+ }
+
+ public async Task<VoiceAudioDeviceInfo[]> ListDevicesAsync()
+ {
+ try
+ {
+ var inputDefaultId = MediaDevice.GetDefaultAudioCaptureId(AudioDeviceRole.Default);
+ var outputDefaultId = MediaDevice.GetDefaultAudioRenderId(AudioDeviceRole.Default);
+ var results = new List<VoiceAudioDeviceInfo>();
+
+ var inputDevices = await DeviceInformation.FindAllAsync(DeviceClass.AudioCapture);
+ foreach (var device in inputDevices)
+ {
+ results.Add(new VoiceAudioDeviceInfo
+ {
+ DeviceId = device.Id,
+ Name = device.Name,
+ IsDefault = string.Equals(device.Id, inputDefaultId, StringComparison.Ordinal),
+ IsInput = true
+ });
+ }
+
+ var outputDevices = await DeviceInformation.FindAllAsync(DeviceClass.AudioRender);
+ foreach (var device in outputDevices)
+ {
+ results.Add(new VoiceAudioDeviceInfo
+ {
+ DeviceId = device.Id,
+ Name = device.Name,
+ IsDefault = string.Equals(device.Id, outputDefaultId, StringComparison.Ordinal),
+ IsOutput = true
+ });
+ }
+
+ return results
+ .OrderByDescending(d => d.IsDefault)
+ .ThenBy(d => d.IsInput ? 0 : 1)
+ .ThenBy(d => d.Name, StringComparer.OrdinalIgnoreCase)
+ .ToArray();
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice device enumeration failed: {ex.Message}");
+ return
+ [
+ new VoiceAudioDeviceInfo
+ {
+ DeviceId = "default-input",
+ Name = "System default microphone",
+ IsDefault = true,
+ IsInput = true
+ },
+ new VoiceAudioDeviceInfo
+ {
+ DeviceId = "default-output",
+ Name = "System default speaker",
+ IsDefault = true,
+ IsOutput = true
+ }
+ ];
+ }
+ }
+
+ public VoiceProviderCatalog GetProviderCatalog()
+ {
+ return VoiceProviderCatalogService.LoadCatalog(_logger);
+ }
+
+ public void Dispose()
+ {
+ if (_disposed)
+ {
+ return;
+ }
+
+ _disposed = true;
+ _ = StopRuntimeResourcesAsync(updateStoppedStatus: true);
+ }
+
+ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? sessionKey)
+ {
+ var selectedSpeechToText = VoiceProviderCatalogService.ResolveSpeechToTextProvider(
+ settings.SpeechToTextProviderId,
+ _logger);
+ var selectedTextToSpeech = VoiceProviderCatalogService.ResolveTextToSpeechProvider(
+ settings.TextToSpeechProviderId,
+ _logger);
+ var fallbackMessage = BuildProviderFallbackMessage(selectedSpeechToText, selectedTextToSpeech);
+
+ await EnsureMicrophoneConsentAsync();
+
+ var runtimeCts = new CancellationTokenSource();
+ var recognizer = await CreateSpeechRecognizerAsync(settings);
+ var synthesizer = new SpeechSynthesizer();
+ var player = new MediaPlayer();
+
+ if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
+ {
+ _logger.Warn("Selected input device is saved, but AlwaysOn currently uses the system speech input device.");
+ }
+
+ if (!string.IsNullOrWhiteSpace(settings.OutputDeviceId))
+ {
+ _logger.Warn("Selected output device is saved, but AlwaysOn currently uses the default speech output device.");
+ }
+
+ recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
+ recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
+
+ lock (_gate)
+ {
+ _runtimeCts = runtimeCts;
+ _speechRecognizer = recognizer;
+ _speechSynthesizer = synthesizer;
+ _mediaPlayer = player;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ sessionKey,
+ VoiceRuntimeState.Arming,
+ fallbackMessage);
+ }
+
+ await EnsureChatTransportAsync(runtimeCts.Token);
+ await StartRecognitionSessionAsync();
+
+ lock (_gate)
+ {
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ sessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ fallbackMessage);
+ }
+ }
+
+ _logger.Info("Voice runtime started in mode AlwaysOn");
+ }
+
+ private async Task<SpeechRecognizer> CreateSpeechRecognizerAsync(VoiceSettings settings)
+ {
+ var recognizer = new SpeechRecognizer();
+ recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromMilliseconds(settings.AlwaysOn.EndSilenceMs);
+ recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(10);
+ recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(4);
+ recognizer.Constraints.Add(new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.Dictation, "always-on-dictation"));
+
+ var compilation = await recognizer.CompileConstraintsAsync();
+ if (compilation.Status != SpeechRecognitionResultStatus.Success)
+ {
+ recognizer.Dispose();
+ throw new InvalidOperationException($"Speech recognizer unavailable: {compilation.Status}");
+ }
+
+ return recognizer;
+ }
+
+ private async Task EnsureMicrophoneConsentAsync()
+ {
+ if (!PackageHelper.IsPackaged)
+ {
+ return;
+ }
+
+ using var capture = new MediaCapture();
+ var initSettings = new MediaCaptureInitializationSettings
+ {
+ StreamingCaptureMode = StreamingCaptureMode.Audio,
+ SharingMode = MediaCaptureSharingMode.SharedReadOnly,
+ MemoryPreference = MediaCaptureMemoryPreference.Cpu
+ };
+
+ await capture.InitializeAsync(initSettings);
+ }
+
+ private async Task EnsureChatTransportAsync(CancellationToken cancellationToken)
+ {
+ OpenClawGatewayClient? existingClient;
+ ConnectionStatus existingStatus;
+
+ lock (_gate)
+ {
+ existingClient = _chatClient;
+ existingStatus = _chatTransportStatus;
+ if (existingStatus == ConnectionStatus.Connected)
+ {
+ return;
+ }
+
+ _transportReadyTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+
+ if (existingClient == null)
+ {
+ _chatClient = new OpenClawGatewayClient(_settings.GatewayUrl, _settings.Token, _logger);
+ _chatClient.StatusChanged += OnChatTransportStatusChanged;
+ _chatClient.NotificationReceived += OnChatNotificationReceived;
+ existingClient = _chatClient;
+ _chatTransportStatus = ConnectionStatus.Connecting;
+ }
+ }
+
+ if (existingStatus == ConnectionStatus.Disconnected || existingClient != _chatClient)
+ {
+ await existingClient!.ConnectAsync();
+ }
+
+ Task readyTask;
+ lock (_gate)
+ {
+ readyTask = _transportReadyTcs?.Task ?? Task.CompletedTask;
+ }
+
+ using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
+ timeoutCts.CancelAfter(TransportConnectTimeout);
+
+ var completed = await Task.WhenAny(readyTask, Task.Delay(Timeout.InfiniteTimeSpan, timeoutCts.Token));
+ if (completed != readyTask)
+ {
+ throw new TimeoutException("Timed out connecting voice chat transport.");
+ }
+
+ await readyTask;
+ }
+
+ private async Task StartRecognitionSessionAsync()
+ {
+ SpeechRecognizer? recognizer;
+
+ lock (_gate)
+ {
+ recognizer = _speechRecognizer;
+ if (recognizer == null || _recognitionActive)
+ {
+ return;
+ }
+ }
+
+ await recognizer.ContinuousRecognitionSession.StartAsync();
+
+ lock (_gate)
+ {
+ _recognitionActive = true;
+ if (_status.Running && !_awaitingReply && !_isSpeaking)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ null);
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ }
+ }
+ }
+
+ private async Task StopRecognitionSessionAsync()
+ {
+ SpeechRecognizer? recognizer;
+
+ lock (_gate)
+ {
+ recognizer = _speechRecognizer;
+ if (recognizer == null || !_recognitionActive)
+ {
+ return;
+ }
+
+ _recognitionActive = false;
+ }
+
+ try
+ {
+ await recognizer.ContinuousRecognitionSession.CancelAsync();
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice recognition stop failed: {ex.Message}");
+ }
+ }
+
+ private async void OnSpeechResultGenerated(
+ SpeechContinuousRecognitionSession sender,
+ SpeechContinuousRecognitionResultGeneratedEventArgs args)
+ {
+ try
+ {
+ var result = args.Result;
+ var text = result.Text?.Trim();
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return;
+ }
+
+ if (result.Status != SpeechRecognitionResultStatus.Success ||
+ result.Confidence == SpeechRecognitionConfidence.Rejected)
+ {
+ return;
+ }
+
+ await HandleRecognizedTextAsync(text);
+ }
+ catch (Exception ex)
+ {
+ _logger.Error("Voice recognition handler failed", ex);
+ lock (_gate)
+ {
+ if (_status.Running)
+ {
+ _status = BuildErrorStatus(VoiceActivationMode.AlwaysOn, _status.SessionKey, GetUserFacingErrorMessage(ex));
+ }
+ }
+ }
+ }
+
+ private async Task HandleRecognizedTextAsync(string text)
+ {
+ CancellationToken cancellationToken;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _status.Mode != VoiceActivationMode.AlwaysOn || !_status.Running)
+ {
+ return;
+ }
+
+ if (_awaitingReply || _isSpeaking)
+ {
+ return;
+ }
+
+ if (string.Equals(text, _lastTranscript, StringComparison.OrdinalIgnoreCase) &&
+ DateTime.UtcNow - _lastTranscriptUtc < DuplicateTranscriptWindow)
+ {
+ return;
+ }
+
+ _lastTranscript = text;
+ _lastTranscriptUtc = DateTime.UtcNow;
+ _awaitingReply = true;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.AwaitingResponse,
+ _status.LastError);
+ _status.LastUtteranceUtc = DateTime.UtcNow;
+ cancellationToken = _runtimeCts.Token;
+ }
+
+ await StopRecognitionSessionAsync();
+
+ try
+ {
+ await EnsureChatTransportAsync(cancellationToken);
+
+ OpenClawGatewayClient? client;
+ lock (_gate)
+ {
+ client = _chatClient;
+ }
+
+ if (client == null)
+ {
+ throw new InvalidOperationException("Voice chat transport is unavailable.");
+ }
+
+ _logger.Info($"Voice transcript captured: {text}");
+ await client.SendChatMessageAsync(text);
+ _ = MonitorReplyTimeoutAsync(text, cancellationToken);
+ }
+ catch (Exception ex)
+ {
+ _logger.Error("Voice transcript submit failed", ex);
+
+ lock (_gate)
+ {
+ _awaitingReply = false;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ GetUserFacingErrorMessage(ex));
+ }
+
+ await StartRecognitionSessionAsync();
+ }
+ }
+
+ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken cancellationToken)
+ {
+ try
+ {
+ await Task.Delay(ReplyTimeout, cancellationToken);
+
+ var shouldResume = false;
+ lock (_gate)
+ {
+ if (_awaitingReply &&
+ string.Equals(_lastTranscript, transcript, StringComparison.OrdinalIgnoreCase))
+ {
+ _awaitingReply = false;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ "Timed out waiting for an assistant reply.");
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ shouldResume = true;
+ }
+ }
+
+ if (shouldResume)
+ {
+ await StartRecognitionSessionAsync();
+ }
+ }
+ catch (OperationCanceledException)
+ {
+ }
+ }
+
+ private async void OnChatNotificationReceived(object? sender, OpenClawNotification notification)
+ {
+ if (!notification.IsChat || string.IsNullOrWhiteSpace(notification.Message))
+ {
+ return;
+ }
+
+ string text;
+
+ lock (_gate)
+ {
+ if (!_awaitingReply || !_status.Running || _status.Mode != VoiceActivationMode.AlwaysOn)
+ {
+ return;
+ }
+
+ _awaitingReply = false;
+ _isSpeaking = true;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
+ text = notification.Message;
+ }
+
+ try
+ {
+ await SpeakTextAsync(text);
+ }
+ catch (Exception ex)
+ {
+ _logger.Error("Voice reply playback failed", ex);
+ lock (_gate)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ GetUserFacingErrorMessage(ex));
+ }
+ }
+ finally
+ {
+ lock (_gate)
+ {
+ _isSpeaking = false;
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ }
+ }
+
+ try
+ {
+ await StartRecognitionSessionAsync();
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice recognition resume failed: {ex.Message}");
+ }
+ }
+ }
+
+ private async Task SpeakTextAsync(string text)
+ {
+ SpeechSynthesizer? synthesizer;
+ MediaPlayer? player;
+
+ lock (_gate)
+ {
+ synthesizer = _speechSynthesizer;
+ player = _mediaPlayer;
+ }
+
+ if (synthesizer == null || player == null)
+ {
+ throw new InvalidOperationException("Speech playback is not ready.");
+ }
+
+ using var stream = await synthesizer.SynthesizeTextToStreamAsync(text);
+ var playbackEnded = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+
+ TypedEventHandler<MediaPlayer, object>? endedHandler = null;
+ TypedEventHandler<MediaPlayer, MediaPlayerFailedEventArgs>? failedHandler = null;
+
+ endedHandler = (sender, _) => playbackEnded.TrySetResult(true);
+ failedHandler = (sender, args) => playbackEnded.TrySetException(new InvalidOperationException(args.ErrorMessage));
+
+ player.MediaEnded += endedHandler;
+ player.MediaFailed += failedHandler;
+
+ try
+ {
+ player.Source = MediaSource.CreateFromStream(stream, stream.ContentType);
+ player.Play();
+ await playbackEnded.Task;
+ }
+ finally
+ {
+ player.MediaEnded -= endedHandler;
+ player.MediaFailed -= failedHandler;
+ player.Source = null;
+ }
+ }
+
+ private async void OnSpeechRecognitionCompleted(
+ SpeechContinuousRecognitionSession sender,
+ SpeechContinuousRecognitionCompletedEventArgs args)
+ {
+ try
+ {
+ CancellationToken token;
+ var shouldRestart = false;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _runtimeCts.IsCancellationRequested)
+ {
+ return;
+ }
+
+ _recognitionActive = false;
+ token = _runtimeCts.Token;
+ shouldRestart = _status.Running &&
+ _status.Mode == VoiceActivationMode.AlwaysOn &&
+ !_awaitingReply &&
+ !_isSpeaking;
+ }
+
+ if (shouldRestart && !token.IsCancellationRequested)
+ {
+ await Task.Delay(250, token);
+ await StartRecognitionSessionAsync();
+ }
+ }
+ catch (OperationCanceledException)
+ {
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice recognition completion handler failed: {ex.Message}");
+ }
+ }
+
+ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus status)
+ {
+ lock (_gate)
+ {
+ _chatTransportStatus = status;
+
+ if (status == ConnectionStatus.Connected)
+ {
+ _transportReadyTcs?.TrySetResult(true);
+
+ if (_status.Running &&
+ _status.Mode == VoiceActivationMode.AlwaysOn &&
+ !_awaitingReply &&
+ !_isSpeaking)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ }
+ }
+ else if (status == ConnectionStatus.Error)
+ {
+ _transportReadyTcs?.TrySetException(
+ new InvalidOperationException("Voice chat transport failed to connect."));
+
+ if (_status.Running && _status.Mode == VoiceActivationMode.AlwaysOn)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ "Voice chat transport failed.");
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ }
+ }
+ else if (status == ConnectionStatus.Disconnected)
+ {
+ if (_status.Running && _status.Mode == VoiceActivationMode.AlwaysOn)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ "Voice chat transport disconnected.");
+ _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ }
+ }
+ }
+ }
+
+ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
+ {
+ CancellationTokenSource? runtimeCts;
+ OpenClawGatewayClient? chatClient;
+ SpeechRecognizer? recognizer;
+ SpeechSynthesizer? synthesizer;
+ MediaPlayer? player;
+ var sessionKey = CurrentStatus.SessionKey;
+
+ lock (_gate)
+ {
+ runtimeCts = _runtimeCts;
+ _runtimeCts = null;
+
+ chatClient = _chatClient;
+ _chatClient = null;
+ _chatTransportStatus = ConnectionStatus.Disconnected;
+ _transportReadyTcs = null;
+
+ recognizer = _speechRecognizer;
+ _speechRecognizer = null;
+ _recognitionActive = false;
+
+ synthesizer = _speechSynthesizer;
+ _speechSynthesizer = null;
+
+ player = _mediaPlayer;
+ _mediaPlayer = null;
+
+ _awaitingReply = false;
+ _isSpeaking = false;
+ }
+
+ try { runtimeCts?.Cancel(); } catch { }
+
+ if (recognizer != null)
+ {
+ recognizer.ContinuousRecognitionSession.ResultGenerated -= OnSpeechResultGenerated;
+ recognizer.ContinuousRecognitionSession.Completed -= OnSpeechRecognitionCompleted;
+
+ try { await recognizer.ContinuousRecognitionSession.CancelAsync(); } catch { }
+ try { recognizer.Dispose(); } catch { }
+ }
+
+ if (player != null)
+ {
+ try { player.Pause(); } catch { }
+ try { player.Source = null; } catch { }
+ try { player.Dispose(); } catch { }
+ }
+
+ try { synthesizer?.Dispose(); } catch { }
+
+ if (chatClient != null)
+ {
+ chatClient.StatusChanged -= OnChatTransportStatusChanged;
+ chatClient.NotificationReceived -= OnChatNotificationReceived;
+ try { await chatClient.DisconnectAsync(); } catch { }
+ try { chatClient.Dispose(); } catch { }
+ }
+
+ try { runtimeCts?.Dispose(); } catch { }
+
+ if (updateStoppedStatus)
+ {
+ lock (_gate)
+ {
+ _status = BuildStoppedStatus(sessionKey, "Disposed");
+ }
+ }
+ }
+
+ private VoiceStatusInfo BuildRunningStatus(
+ VoiceActivationMode mode,
+ string? sessionKey,
+ VoiceRuntimeState state,
+ string? lastError)
+ {
+ var settings = _settings.Voice;
+ return new VoiceStatusInfo
+ {
+ Available = true,
+ Running = true,
+ Mode = mode,
+ State = state,
+ SessionKey = sessionKey,
+ InputDeviceId = settings.InputDeviceId,
+ OutputDeviceId = settings.OutputDeviceId,
+ WakeWordModelId = settings.WakeWord.ModelId,
+ WakeWordLoaded = mode == VoiceActivationMode.WakeWord,
+ LastWakeWordUtc = _status.LastWakeWordUtc,
+ LastUtteranceUtc = _status.LastUtteranceUtc,
+ LastError = lastError
+ };
+ }
+
+ private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
+ {
+ var settings = _settings.Voice;
+ return new VoiceStatusInfo
+ {
+ Available = true,
+ Running = false,
+ Mode = _runtimeModeOverride ?? settings.Mode,
+ State = VoiceRuntimeState.Stopped,
+ SessionKey = sessionKey,
+ InputDeviceId = settings.InputDeviceId,
+ OutputDeviceId = settings.OutputDeviceId,
+ WakeWordModelId = settings.WakeWord.ModelId,
+ WakeWordLoaded = false,
+ LastWakeWordUtc = _status.LastWakeWordUtc,
+ LastUtteranceUtc = _status.LastUtteranceUtc,
+ LastError = reason
+ };
+ }
+
+ private VoiceStatusInfo BuildErrorStatus(VoiceActivationMode mode, string? sessionKey, string? reason)
+ {
+ var status = BuildRunningStatus(mode, sessionKey, VoiceRuntimeState.Error, reason);
+ status.Running = false;
+ return status;
+ }
+
+ private static VoiceSettings Clone(VoiceSettings source)
+ {
+ return new VoiceSettings
+ {
+ Mode = source.Mode,
+ Enabled = source.Enabled,
+ SpeechToTextProviderId = source.SpeechToTextProviderId,
+ TextToSpeechProviderId = source.TextToSpeechProviderId,
+ InputDeviceId = source.InputDeviceId,
+ OutputDeviceId = source.OutputDeviceId,
+ SampleRateHz = source.SampleRateHz,
+ CaptureChunkMs = source.CaptureChunkMs,
+ BargeInEnabled = source.BargeInEnabled,
+ WakeWord = new VoiceWakeWordSettings
+ {
+ Engine = source.WakeWord.Engine,
+ ModelId = source.WakeWord.ModelId,
+ TriggerThreshold = source.WakeWord.TriggerThreshold,
+ TriggerCooldownMs = source.WakeWord.TriggerCooldownMs,
+ PreRollMs = source.WakeWord.PreRollMs,
+ EndSilenceMs = source.WakeWord.EndSilenceMs
+ },
+ AlwaysOn = new VoiceAlwaysOnSettings
+ {
+ MinSpeechMs = source.AlwaysOn.MinSpeechMs,
+ EndSilenceMs = source.AlwaysOn.EndSilenceMs,
+ MaxUtteranceMs = source.AlwaysOn.MaxUtteranceMs,
+ AutoSubmit = source.AlwaysOn.AutoSubmit
+ }
+ };
+ }
+
+ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
+ {
+ return new VoiceStatusInfo
+ {
+ Available = source.Available,
+ Running = source.Running,
+ Mode = source.Mode,
+ State = source.State,
+ SessionKey = source.SessionKey,
+ InputDeviceId = source.InputDeviceId,
+ OutputDeviceId = source.OutputDeviceId,
+ WakeWordModelId = source.WakeWordModelId,
+ WakeWordLoaded = source.WakeWordLoaded,
+ LastWakeWordUtc = source.LastWakeWordUtc,
+ LastUtteranceUtc = source.LastUtteranceUtc,
+ LastError = source.LastError
+ };
+ }
+
+ private static string? BuildProviderFallbackMessage(
+ VoiceProviderOption speechToTextProvider,
+ VoiceProviderOption textToSpeechProvider)
+ {
+ var fallbacks = new List<string>();
+
+ if (!VoiceProviderCatalogService.SupportsWindowsRuntime(speechToTextProvider.Id))
+ {
+ fallbacks.Add($"STT '{speechToTextProvider.Name}' is not implemented yet; using Windows Speech Recognition.");
+ }
+
+ if (!VoiceProviderCatalogService.SupportsWindowsRuntime(textToSpeechProvider.Id))
+ {
+ fallbacks.Add($"TTS '{textToSpeechProvider.Name}' is not implemented yet; using Windows Speech Synthesis.");
+ }
+
+ return fallbacks.Count == 0 ? null : string.Join(" ", fallbacks);
+ }
+
+ private static string GetUserFacingErrorMessage(Exception ex)
+ {
+ if (IsSpeechPrivacyDeclined(ex))
+ {
+ return "Windows online speech recognition is disabled. Open Settings > Privacy & security > Speech and turn on Online speech recognition, then restart Voice Mode.";
+ }
+
+ if (ex is UnauthorizedAccessException)
+ {
+ return "Microphone access is blocked. Open Settings > Privacy & security > Microphone and allow desktop apps to use the microphone.";
+ }
+
+ return ex.Message;
+ }
+
+ private static bool IsSpeechPrivacyDeclined(Exception ex)
+ {
+ if (ex.HResult == HResultSpeechPrivacyDeclined)
+ {
+ return true;
+ }
+
+ return ex.Message.Contains("speech privacy policy", StringComparison.OrdinalIgnoreCase) ||
+ ex.Message.Contains("online speech recognition", StringComparison.OrdinalIgnoreCase);
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
new file mode 100644
index 0000000..57cb962
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
@@ -0,0 +1,92 @@
+<?xml version="1.0" encoding="utf-8"?>
+<winex:WindowEx
+ x:Class="OpenClawTray.Windows.VoiceModeWindow"
+ xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
+ xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
+ xmlns:winex="using:WinUIEx"
+ Title="Voice Mode"
+ MinWidth="420"
+ MinHeight="480">
+
+ <Window.SystemBackdrop>
+ <MicaBackdrop/>
+ </Window.SystemBackdrop>
+
+ <Grid>
+ <Grid.RowDefinitions>
+ <RowDefinition Height="*"/>
+ <RowDefinition Height="Auto"/>
+ </Grid.RowDefinitions>
+
+ <ScrollViewer Grid.Row="0" VerticalScrollBarVisibility="Auto" Padding="24,24,24,12">
+ <StackPanel Spacing="24" MaxWidth="500">
+ <StackPanel Spacing="8">
+ <TextBlock Text="VOICE MODE" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <TextBlock Text="Windows voice mode targets macOS parity: Always On maps to Talk Mode, and WakeWord maps to Voice Wake." TextWrapping="Wrap"/>
+ </StackPanel>
+
+ <StackPanel Spacing="8">
+ <ComboBox x:Name="ModeComboBox" Header="Mode" SelectionChanged="OnModeChanged">
+ <ComboBoxItem Content="Off" Tag="Off"/>
+ <ComboBoxItem Content="WakeWord" Tag="WakeWord"/>
+ <ComboBoxItem Content="AlwaysOn" Tag="AlwaysOn"/>
+ </ComboBox>
+
+ <InfoBar x:Name="ModeInfoBar" IsOpen="True" Severity="Informational" IsClosable="False"
+ Title="Implementation status"
+ Message="AlwaysOn is the current runtime target. WakeWord settings can be configured now, and NanoWakeWord activation will follow in a later step."/>
+ </StackPanel>
+
+ <StackPanel Spacing="8">
+ <TextBlock Text="PROVIDERS" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <ComboBox x:Name="SpeechToTextProviderComboBox" Header="Speech to text provider" DisplayMemberPath="Name" SelectionChanged="OnProviderChanged"/>
+ <ComboBox x:Name="TextToSpeechProviderComboBox" Header="Text to speech provider" DisplayMemberPath="Name" SelectionChanged="OnProviderChanged"/>
+ <TextBlock x:Name="ProviderInfoTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </StackPanel>
+
+ <StackPanel Spacing="8">
+ <TextBlock Text="DEVICES" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <ComboBox x:Name="InputDeviceComboBox" Header="Listen device (microphone)" DisplayMemberPath="Name"/>
+ <ComboBox x:Name="OutputDeviceComboBox" Header="Talk device (speaker)" DisplayMemberPath="Name"/>
+ <Button Content="Refresh devices" HorizontalAlignment="Left" Click="OnRefreshDevices"/>
+ </StackPanel>
+
+ <TextBlock x:Name="StatusTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+
+ <StackPanel x:Name="TroubleshootingPanel" Spacing="8" Visibility="Collapsed">
+ <TextBlock x:Name="TroubleshootingTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ <StackPanel Orientation="Horizontal" Spacing="8">
+ <Button x:Name="OpenSpeechSettingsButton"
+ Content="Open Speech Settings"
+ Click="OnOpenSpeechSettings"
+ Visibility="Collapsed"/>
+ <Button x:Name="OpenMicrophoneSettingsButton"
+ Content="Open Microphone Settings"
+ Click="OnOpenMicrophoneSettings"
+ Visibility="Collapsed"/>
+ </StackPanel>
+ </StackPanel>
+ </StackPanel>
+ </ScrollViewer>
+
+ <Border Grid.Row="1"
+ Background="{ThemeResource CardBackgroundFillColorDefaultBrush}"
+ BorderBrush="{ThemeResource CardStrokeColorDefaultBrush}"
+ BorderThickness="0,1,0,0"
+ Padding="24,16">
+ <StackPanel Orientation="Horizontal" Spacing="8" HorizontalAlignment="Right">
+ <Button Content="Cancel" Click="OnCancel" Width="80"/>
+ <Button Content="Save" Click="OnSave" Width="80" Style="{ThemeResource AccentButtonStyle}"/>
+ </StackPanel>
+ </Border>
+ </Grid>
+</winex:WindowEx>
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
new file mode 100644
index 0000000..8137732
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -0,0 +1,330 @@
+using Microsoft.UI.Xaml;
+using Microsoft.UI.Xaml.Controls;
+using OpenClaw.Shared;
+using OpenClawTray.Helpers;
+using OpenClawTray.Services;
+using System;
+using System.Collections.Generic;
+using System.Diagnostics;
+using System.Linq;
+using System.Threading.Tasks;
+using WinUIEx;
+
+namespace OpenClawTray.Windows;
+
+public sealed partial class VoiceModeWindow : WindowEx
+{
+ private readonly SettingsManager _settings;
+ private readonly VoiceService _voiceService;
+ private List<ProviderOption> _speechToTextOptions = new();
+ private List<ProviderOption> _textToSpeechOptions = new();
+ private List<DeviceOption> _inputOptions = new();
+ private List<DeviceOption> _outputOptions = new();
+
+ public bool IsClosed { get; private set; }
+
+ public VoiceModeWindow(SettingsManager settings, VoiceService voiceService)
+ {
+ _settings = settings;
+ _voiceService = voiceService;
+
+ InitializeComponent();
+
+ Title = "Voice Mode";
+ this.SetWindowSize(520, 620);
+ this.CenterOnScreen();
+ this.SetIcon(IconHelper.GetStatusIconPath(ConnectionStatus.Connected));
+
+ Closed += (s, e) => IsClosed = true;
+
+ LoadSettings();
+ _ = LoadDevicesAsync();
+ }
+
+ private void LoadSettings()
+ {
+ LoadProviders();
+ SelectMode(_settings.Voice.Mode);
+ UpdateModeInfo();
+ UpdateProviderInfo();
+ StatusTextBlock.Text = BuildStatusText();
+ }
+
+ private void LoadProviders()
+ {
+ var catalog = _voiceService.GetProviderCatalog();
+
+ _speechToTextOptions = catalog.SpeechToTextProviders
+ .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .ToList();
+ _textToSpeechOptions = catalog.TextToSpeechProviders
+ .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .ToList();
+
+ SpeechToTextProviderComboBox.ItemsSource = _speechToTextOptions;
+ TextToSpeechProviderComboBox.ItemsSource = _textToSpeechOptions;
+
+ SpeechToTextProviderComboBox.SelectedItem =
+ _speechToTextOptions.FirstOrDefault(p => p.Id == _settings.Voice.SpeechToTextProviderId)
+ ?? _speechToTextOptions.FirstOrDefault();
+ TextToSpeechProviderComboBox.SelectedItem =
+ _textToSpeechOptions.FirstOrDefault(p => p.Id == _settings.Voice.TextToSpeechProviderId)
+ ?? _textToSpeechOptions.FirstOrDefault();
+ }
+
+ private async Task LoadDevicesAsync()
+ {
+ try
+ {
+ StatusTextBlock.Text = "Loading audio devices...";
+ var devices = await _voiceService.ListDevicesAsync();
+
+ _inputOptions =
+ [
+ new DeviceOption(null, "System default microphone")
+ ];
+ _inputOptions.AddRange(devices
+ .Where(d => d.IsInput)
+ .Select(d => new DeviceOption(d.DeviceId, d.Name)));
+
+ _outputOptions =
+ [
+ new DeviceOption(null, "System default speaker")
+ ];
+ _outputOptions.AddRange(devices
+ .Where(d => d.IsOutput)
+ .Select(d => new DeviceOption(d.DeviceId, d.Name)));
+
+ InputDeviceComboBox.ItemsSource = _inputOptions;
+ OutputDeviceComboBox.ItemsSource = _outputOptions;
+
+ InputDeviceComboBox.SelectedItem = _inputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.InputDeviceId) ?? _inputOptions[0];
+ OutputDeviceComboBox.SelectedItem = _outputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.OutputDeviceId) ?? _outputOptions[0];
+
+ StatusTextBlock.Text = BuildStatusText();
+ }
+ catch (Exception ex)
+ {
+ StatusTextBlock.Text = $"Failed to load devices: {ex.Message}";
+ }
+ }
+
+ private void SelectMode(VoiceActivationMode mode)
+ {
+ var target = mode switch
+ {
+ VoiceActivationMode.WakeWord => "WakeWord",
+ VoiceActivationMode.AlwaysOn => "AlwaysOn",
+ _ => "Off"
+ };
+
+ foreach (var item in ModeComboBox.Items.OfType<ComboBoxItem>())
+ {
+ if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
+ {
+ ModeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ ModeComboBox.SelectedIndex = 0;
+ }
+
+ private VoiceActivationMode GetSelectedMode()
+ {
+ var tag = (ModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
+ return tag switch
+ {
+ "WakeWord" => VoiceActivationMode.WakeWord,
+ "AlwaysOn" => VoiceActivationMode.AlwaysOn,
+ _ => VoiceActivationMode.Off
+ };
+ }
+
+ private string BuildStatusText()
+ {
+ var running = _voiceService.CurrentStatus;
+ var runtime = running.Running
+ ? $"{running.Mode} ({running.State})"
+ : "Off";
+ var nodeMode = _settings.EnableNodeMode ? "enabled" : "disabled";
+ var stt = (SpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
+ var tts = (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
+ var error = string.IsNullOrWhiteSpace(running.LastError)
+ ? string.Empty
+ : $" Last issue: {running.LastError}.";
+ UpdateTroubleshooting(running.LastError);
+ return $"Runtime: {runtime}. Node Mode is {nodeMode}. STT: {stt}. TTS: {tts}.{error}";
+ }
+
+ private void UpdateModeInfo()
+ {
+ var mode = GetSelectedMode();
+ ModeInfoBar.Message = mode switch
+ {
+ VoiceActivationMode.WakeWord => "WakeWord settings are saved now, but NanoWakeWord activation is still the next implementation step.",
+ VoiceActivationMode.AlwaysOn => "AlwaysOn is the first active runtime target. It uses Windows speech recognition and turn-based reply playback today.",
+ _ => "Voice runtime stays off until you choose a listening mode."
+ };
+ }
+
+ private void UpdateProviderInfo()
+ {
+ var stt = SpeechToTextProviderComboBox.SelectedItem as ProviderOption;
+ var tts = TextToSpeechProviderComboBox.SelectedItem as ProviderOption;
+
+ var details = new List<string>();
+ if (stt != null)
+ {
+ details.Add($"STT: {stt.Name}");
+ }
+
+ if (tts != null)
+ {
+ details.Add($"TTS: {tts.Name}");
+ }
+
+ var fallbackNotice = (stt != null && !string.Equals(stt.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase)) ||
+ (tts != null && !string.Equals(tts.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ ? " Selected non-Windows providers are saved now but will fall back to Windows until their runtime adapters are added."
+ : string.Empty;
+
+ ProviderInfoTextBlock.Text =
+ $"{string.Join(" ┬╖ ", details)}. Configure extra providers in {VoiceProviderCatalogService.CatalogFilePath}.{fallbackNotice}";
+ }
+
+ private void UpdateTroubleshooting(string? error)
+ {
+ TroubleshootingPanel.Visibility = Visibility.Collapsed;
+ OpenSpeechSettingsButton.Visibility = Visibility.Collapsed;
+ OpenMicrophoneSettingsButton.Visibility = Visibility.Collapsed;
+ TroubleshootingTextBlock.Text = string.Empty;
+
+ if (string.IsNullOrWhiteSpace(error))
+ {
+ return;
+ }
+
+ if (error.Contains("online speech recognition is disabled", StringComparison.OrdinalIgnoreCase))
+ {
+ TroubleshootingPanel.Visibility = Visibility.Visible;
+ OpenSpeechSettingsButton.Visibility = Visibility.Visible;
+ TroubleshootingTextBlock.Text =
+ "To fix this: open Windows Settings, go to Privacy & security > Speech, turn on Online speech recognition, then restart Voice Mode.";
+ return;
+ }
+
+ if (error.Contains("microphone access is blocked", StringComparison.OrdinalIgnoreCase))
+ {
+ TroubleshootingPanel.Visibility = Visibility.Visible;
+ OpenMicrophoneSettingsButton.Visibility = Visibility.Visible;
+ TroubleshootingTextBlock.Text =
+ "To fix this: open Windows Settings, go to Privacy & security > Microphone, allow microphone access and enable desktop app access, then restart Voice Mode.";
+ }
+ }
+
+ private async void OnRefreshDevices(object sender, RoutedEventArgs e)
+ {
+ await LoadDevicesAsync();
+ }
+
+ private void OnModeChanged(object sender, SelectionChangedEventArgs e)
+ {
+ UpdateModeInfo();
+ StatusTextBlock.Text = BuildStatusText();
+ }
+
+ private void OnProviderChanged(object sender, SelectionChangedEventArgs e)
+ {
+ UpdateProviderInfo();
+ StatusTextBlock.Text = BuildStatusText();
+ }
+
+ private void OnOpenSpeechSettings(object sender, RoutedEventArgs e)
+ {
+ OpenSettingsUri("ms-settings:privacy-speech");
+ }
+
+ private void OnOpenMicrophoneSettings(object sender, RoutedEventArgs e)
+ {
+ OpenSettingsUri("ms-settings:privacy-microphone");
+ }
+
+ private async void OnSave(object sender, RoutedEventArgs e)
+ {
+ var updated = new VoiceSettings
+ {
+ Mode = GetSelectedMode(),
+ Enabled = GetSelectedMode() != VoiceActivationMode.Off,
+ SpeechToTextProviderId = (SpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ TextToSpeechProviderId = (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ InputDeviceId = (InputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
+ OutputDeviceId = (OutputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
+ SampleRateHz = _settings.Voice.SampleRateHz,
+ CaptureChunkMs = _settings.Voice.CaptureChunkMs,
+ BargeInEnabled = _settings.Voice.BargeInEnabled,
+ WakeWord = new VoiceWakeWordSettings
+ {
+ Engine = _settings.Voice.WakeWord.Engine,
+ ModelId = _settings.Voice.WakeWord.ModelId,
+ TriggerThreshold = _settings.Voice.WakeWord.TriggerThreshold,
+ TriggerCooldownMs = _settings.Voice.WakeWord.TriggerCooldownMs,
+ PreRollMs = _settings.Voice.WakeWord.PreRollMs,
+ EndSilenceMs = _settings.Voice.WakeWord.EndSilenceMs
+ },
+ AlwaysOn = new VoiceAlwaysOnSettings
+ {
+ MinSpeechMs = _settings.Voice.AlwaysOn.MinSpeechMs,
+ EndSilenceMs = _settings.Voice.AlwaysOn.EndSilenceMs,
+ MaxUtteranceMs = _settings.Voice.AlwaysOn.MaxUtteranceMs,
+ AutoSubmit = _settings.Voice.AlwaysOn.AutoSubmit
+ }
+ };
+
+ try
+ {
+ await _voiceService.UpdateSettingsAsync(new VoiceSettingsUpdateArgs
+ {
+ Settings = updated,
+ Persist = true
+ });
+
+ if (_settings.EnableNodeMode)
+ {
+ if (updated.Mode == VoiceActivationMode.Off)
+ {
+ await _voiceService.StopAsync(new VoiceStopArgs { Reason = "Voice mode disabled by user" });
+ }
+ else
+ {
+ await _voiceService.StartAsync(new VoiceStartArgs { Mode = updated.Mode });
+ }
+ }
+
+ Close();
+ }
+ catch (Exception ex)
+ {
+ StatusTextBlock.Text = $"Failed to save voice settings: {ex.Message}";
+ }
+ }
+
+ private void OnCancel(object sender, RoutedEventArgs e)
+ {
+ Close();
+ }
+
+ private static void OpenSettingsUri(string uri)
+ {
+ try
+ {
+ Process.Start(new ProcessStartInfo(uri) { UseShellExecute = true });
+ }
+ catch
+ {
+ }
+ }
+
+ private sealed record DeviceOption(string? DeviceId, string Name);
+ private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
+}
diff --git a/tests/OpenClaw.Shared.Tests/CapabilityTests.cs b/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
index 350cb8d..edcd834 100644
--- a/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
+++ b/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
@@ -931,3 +931,205 @@ public async Task Snap_ReturnsError_WhenHandlerThrows()
Assert.Contains("Camera access blocked", res.Error);
}
}
+
+public class VoiceCapabilityTests
+{
+ private static JsonElement Parse(string json)
+ {
+ using var doc = JsonDocument.Parse(json);
+ return doc.RootElement.Clone();
+ }
+
+ [Fact]
+ public void CanHandle_VoiceCommands()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ Assert.True(cap.CanHandle(VoiceCommands.ListDevices));
+ Assert.True(cap.CanHandle(VoiceCommands.GetSettings));
+ Assert.True(cap.CanHandle(VoiceCommands.SetSettings));
+ Assert.True(cap.CanHandle(VoiceCommands.GetStatus));
+ Assert.True(cap.CanHandle(VoiceCommands.Start));
+ Assert.True(cap.CanHandle(VoiceCommands.Stop));
+ Assert.False(cap.CanHandle("voice.unknown"));
+ Assert.Equal("voice", cap.Category);
+ }
+
+ [Fact]
+ public async Task ListDevices_ReturnsArrayFromHandler()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ cap.ListDevicesRequested += () => Task.FromResult<VoiceAudioDeviceInfo[]>(
+ [
+ new VoiceAudioDeviceInfo
+ {
+ DeviceId = "default-input",
+ Name = "System default microphone",
+ IsDefault = true,
+ IsInput = true
+ }
+ ]);
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice1",
+ Command = VoiceCommands.ListDevices,
+ Args = Parse("""{}""")
+ });
+
+ Assert.True(res.Ok);
+ var json = JsonSerializer.Serialize(res.Payload);
+ using var doc = JsonDocument.Parse(json);
+ Assert.Equal(JsonValueKind.Array, doc.RootElement.ValueKind);
+ Assert.Equal("default-input", doc.RootElement[0].GetProperty("DeviceId").GetString());
+ }
+
+ [Fact]
+ public async Task GetSettings_ReturnsSettingsFromHandler()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ cap.SettingsRequested += () => Task.FromResult(new VoiceSettings
+ {
+ Enabled = true,
+ Mode = VoiceActivationMode.WakeWord
+ });
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice2",
+ Command = VoiceCommands.GetSettings,
+ Args = Parse("""{}""")
+ });
+
+ Assert.True(res.Ok);
+ var json = JsonSerializer.Serialize(res.Payload);
+ using var doc = JsonDocument.Parse(json);
+ Assert.True(doc.RootElement.GetProperty("Enabled").GetBoolean());
+ Assert.Equal("WakeWord", doc.RootElement.GetProperty("Mode").GetString());
+ }
+
+ [Fact]
+ public async Task SetSettings_UsesUpdateEnvelope_WhenPresent()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ VoiceSettingsUpdateArgs? received = null;
+ cap.SettingsUpdateRequested += update =>
+ {
+ received = update;
+ return Task.FromResult(update.Settings);
+ };
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice3",
+ Command = VoiceCommands.SetSettings,
+ Args = Parse("""{"update":{"persist":false,"settings":{"enabled":true,"mode":"AlwaysOn"}}}""")
+ });
+
+ Assert.True(res.Ok);
+ Assert.NotNull(received);
+ Assert.False(received!.Persist);
+ Assert.Equal(VoiceActivationMode.AlwaysOn, received.Settings.Mode);
+ }
+
+ [Fact]
+ public async Task GetStatus_ReturnsStatusFromHandler()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ cap.StatusRequested += () => Task.FromResult(new VoiceStatusInfo
+ {
+ Available = true,
+ Running = true,
+ Mode = VoiceActivationMode.AlwaysOn,
+ State = VoiceRuntimeState.ListeningContinuously
+ });
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice4",
+ Command = VoiceCommands.GetStatus,
+ Args = Parse("""{}""")
+ });
+
+ Assert.True(res.Ok);
+ var json = JsonSerializer.Serialize(res.Payload);
+ using var doc = JsonDocument.Parse(json);
+ Assert.True(doc.RootElement.GetProperty("Running").GetBoolean());
+ Assert.Equal("ListeningContinuously", doc.RootElement.GetProperty("State").GetString());
+ }
+
+ [Fact]
+ public async Task Start_PassesArgsToHandler()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ VoiceStartArgs? received = null;
+ cap.StartRequested += args =>
+ {
+ received = args;
+ return Task.FromResult(new VoiceStatusInfo
+ {
+ Available = true,
+ Running = true,
+ Mode = args.Mode ?? VoiceActivationMode.Off,
+ State = VoiceRuntimeState.ListeningForWakeWord,
+ SessionKey = args.SessionKey
+ });
+ };
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice5",
+ Command = VoiceCommands.Start,
+ Args = Parse("""{"mode":"WakeWord","sessionKey":"session-123"}""")
+ });
+
+ Assert.True(res.Ok);
+ Assert.NotNull(received);
+ Assert.Equal(VoiceActivationMode.WakeWord, received!.Mode);
+ Assert.Equal("session-123", received.SessionKey);
+ }
+
+ [Fact]
+ public async Task Stop_PassesReasonToHandler()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ VoiceStopArgs? received = null;
+ cap.StopRequested += args =>
+ {
+ received = args;
+ return Task.FromResult(new VoiceStatusInfo
+ {
+ Available = true,
+ Running = false,
+ Mode = VoiceActivationMode.Off,
+ State = VoiceRuntimeState.Stopped,
+ LastError = args.Reason
+ });
+ };
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice6",
+ Command = VoiceCommands.Stop,
+ Args = Parse("""{"reason":"user requested"}""")
+ });
+
+ Assert.True(res.Ok);
+ Assert.NotNull(received);
+ Assert.Equal("user requested", received!.Reason);
+ }
+
+ [Fact]
+ public async Task Start_ReturnsError_WhenHandlerMissing()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice7",
+ Command = VoiceCommands.Start,
+ Args = Parse("""{}""")
+ });
+
+ Assert.False(res.Ok);
+ Assert.Contains("not available", res.Error, StringComparison.OrdinalIgnoreCase);
+ }
+}
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
new file mode 100644
index 0000000..3fd1e85
--- /dev/null
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -0,0 +1,77 @@
+using OpenClaw.Shared;
+using System.Text.Json;
+
+namespace OpenClaw.Shared.Tests;
+
+public class VoiceCommandsTests
+{
+ [Fact]
+ public void All_ContainsExpectedCommandsInStableOrder()
+ {
+ Assert.Equal(
+ [
+ "voice.devices.list",
+ "voice.settings.get",
+ "voice.settings.set",
+ "voice.status.get",
+ "voice.start",
+ "voice.stop"
+ ],
+ VoiceCommands.All);
+ }
+}
+
+public class VoiceSchemaDefaultsTests
+{
+ [Fact]
+ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
+ {
+ var settings = new VoiceSettings();
+
+ Assert.False(settings.Enabled);
+ Assert.Equal(VoiceActivationMode.Off, settings.Mode);
+ Assert.Equal(VoiceProviderIds.Windows, settings.SpeechToTextProviderId);
+ Assert.Equal(VoiceProviderIds.Windows, settings.TextToSpeechProviderId);
+ Assert.Equal(16000, settings.SampleRateHz);
+ Assert.Equal(80, settings.CaptureChunkMs);
+ Assert.True(settings.BargeInEnabled);
+ Assert.Equal("NanoWakeWord", settings.WakeWord.Engine);
+ Assert.Equal("hey_openclaw", settings.WakeWord.ModelId);
+ Assert.Equal(0.65f, settings.WakeWord.TriggerThreshold);
+ Assert.Equal(250, settings.AlwaysOn.MinSpeechMs);
+ Assert.True(settings.AlwaysOn.AutoSubmit);
+ }
+
+ [Fact]
+ public void VoiceStatusInfo_Defaults_ToStopped()
+ {
+ var status = new VoiceStatusInfo();
+
+ Assert.False(status.Available);
+ Assert.False(status.Running);
+ Assert.Equal(VoiceActivationMode.Off, status.Mode);
+ Assert.Equal(VoiceRuntimeState.Stopped, status.State);
+ Assert.False(status.WakeWordLoaded);
+ Assert.Null(status.LastError);
+ }
+
+ [Fact]
+ public void VoiceEnums_Serialize_AsStrings()
+ {
+ var json = JsonSerializer.Serialize(new VoiceStartArgs
+ {
+ Mode = VoiceActivationMode.WakeWord
+ });
+
+ Assert.Contains("\"WakeWord\"", json);
+ }
+
+ [Fact]
+ public void VoiceProviderCatalog_Defaults_ToEmptyLists()
+ {
+ var catalog = new VoiceProviderCatalog();
+
+ Assert.Empty(catalog.SpeechToTextProviders);
+ Assert.Empty(catalog.TextToSpeechProviders);
+ }
+}
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 8b09519..7533d1b 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -28,6 +28,34 @@ public void RoundTrip_AllFields_Preserved()
HasSeenActivityStreamTip = true,
NotifyChatResponses = false,
PreferStructuredCategories = true,
+ Voice = new VoiceSettings
+ {
+ Enabled = true,
+ Mode = VoiceActivationMode.WakeWord,
+ SpeechToTextProviderId = "windows",
+ TextToSpeechProviderId = "elevenlabs",
+ InputDeviceId = "mic-1",
+ OutputDeviceId = "spk-2",
+ SampleRateHz = 16000,
+ CaptureChunkMs = 80,
+ BargeInEnabled = false,
+ WakeWord = new VoiceWakeWordSettings
+ {
+ Engine = "NanoWakeWord",
+ ModelId = "hey_openclaw",
+ TriggerThreshold = 0.72f,
+ TriggerCooldownMs = 2500,
+ PreRollMs = 1400,
+ EndSilenceMs = 1000
+ },
+ AlwaysOn = new VoiceAlwaysOnSettings
+ {
+ MinSpeechMs = 300,
+ EndSilenceMs = 1100,
+ MaxUtteranceMs = 18000,
+ AutoSubmit = false
+ }
+ },
UserRules = new List<UserNotificationRule>
{
new() { Pattern = "build.*fail", IsRegex = true, Category = "urgent", Enabled = true }
@@ -56,6 +84,18 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal(original.HasSeenActivityStreamTip, restored.HasSeenActivityStreamTip);
Assert.Equal(original.NotifyChatResponses, restored.NotifyChatResponses);
Assert.Equal(original.PreferStructuredCategories, restored.PreferStructuredCategories);
+ Assert.NotNull(restored.Voice);
+ Assert.True(restored.Voice.Enabled);
+ Assert.Equal(VoiceActivationMode.WakeWord, restored.Voice.Mode);
+ Assert.Equal("windows", restored.Voice.SpeechToTextProviderId);
+ Assert.Equal("elevenlabs", restored.Voice.TextToSpeechProviderId);
+ Assert.Equal("mic-1", restored.Voice.InputDeviceId);
+ Assert.Equal("spk-2", restored.Voice.OutputDeviceId);
+ Assert.Equal("NanoWakeWord", restored.Voice.WakeWord.Engine);
+ Assert.Equal("hey_openclaw", restored.Voice.WakeWord.ModelId);
+ Assert.Equal(0.72f, restored.Voice.WakeWord.TriggerThreshold);
+ Assert.Equal(300, restored.Voice.AlwaysOn.MinSpeechMs);
+ Assert.False(restored.Voice.AlwaysOn.AutoSubmit);
Assert.NotNull(restored.UserRules);
Assert.Single(restored.UserRules);
Assert.Equal("build.*fail", restored.UserRules[0].Pattern);
@@ -101,6 +141,13 @@ public void MissingFields_UseDefaults()
Assert.False(settings.HasSeenActivityStreamTip);
Assert.True(settings.NotifyChatResponses);
Assert.True(settings.PreferStructuredCategories);
+ Assert.NotNull(settings.Voice);
+ Assert.False(settings.Voice.Enabled);
+ Assert.Equal(VoiceActivationMode.Off, settings.Voice.Mode);
+ Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
+ Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
+ Assert.Equal(16000, settings.Voice.SampleRateHz);
+ Assert.Equal("NanoWakeWord", settings.Voice.WakeWord.Engine);
Assert.Null(settings.UserRules);
}
@@ -137,6 +184,11 @@ public void BackwardCompatibility_OldSettingsWithoutNewFields()
Assert.False(settings.EnableNodeMode);
Assert.False(settings.HasSeenActivityStreamTip);
Assert.True(settings.GlobalHotkeyEnabled);
+ Assert.NotNull(settings.Voice);
+ Assert.False(settings.Voice.Enabled);
+ Assert.Equal(VoiceActivationMode.Off, settings.Voice.Mode);
+ Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
+ Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
Assert.Null(settings.UserRules);
}
From f40ffc34504c7d99a087c511dacb1a68b5467b7c Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 00:25:23 +0000
Subject: [PATCH 02/83] Fix voice chat transport and reply routing
---
src/OpenClaw.Shared/Models.cs | 8 +
src/OpenClaw.Shared/OpenClawGatewayClient.cs | 288 +++++++++++++++---
.../OpenClawGatewayClientTests.cs | 230 ++++++++++++++
3 files changed, 489 insertions(+), 37 deletions(-)
diff --git a/src/OpenClaw.Shared/Models.cs b/src/OpenClaw.Shared/Models.cs
index 725a66b..5b56953 100644
--- a/src/OpenClaw.Shared/Models.cs
+++ b/src/OpenClaw.Shared/Models.cs
@@ -86,6 +86,14 @@ public class OpenClawNotification
public string[]? Tags { get; set; } // free-form routing tags
}
+public class ChatMessageEventArgs : EventArgs
+{
+ public string SessionKey { get; set; } = "main";
+ public string Role { get; set; } = "";
+ public string Message { get; set; } = "";
+ public bool IsFinal { get; set; }
+}
+
/// <summary>
/// A user-defined notification categorization rule.
/// </summary>
diff --git a/src/OpenClaw.Shared/OpenClawGatewayClient.cs b/src/OpenClaw.Shared/OpenClawGatewayClient.cs
index 0e21836..45ce080 100644
--- a/src/OpenClaw.Shared/OpenClawGatewayClient.cs
+++ b/src/OpenClaw.Shared/OpenClawGatewayClient.cs
@@ -15,13 +15,18 @@ public class OpenClawGatewayClient : WebSocketClientBase
private GatewayUsageStatusInfo? _usageStatus;
private GatewayCostUsageInfo? _usageCost;
private readonly Dictionary<string, string> _pendingRequestMethods = new();
+ private readonly HashSet<string> _pendingChatPreviewSessionKeys = new();
private readonly object _pendingRequestLock = new();
+ private readonly object _pendingChatPreviewLock = new();
private readonly object _sessionsLock = new();
private readonly object _nodesLock = new();
private bool _usageStatusUnsupported;
private bool _usageCostUnsupported;
private bool _sessionPreviewUnsupported;
private bool _nodeListUnsupported;
+ private string _defaultChatSessionKey = DefaultChatSessionKey;
+
+ private const string DefaultChatSessionKey = "main";
private void ResetUnsupportedMethodFlags()
{
@@ -49,15 +54,18 @@ protected override Task OnConnectedAsync()
protected override void OnDisconnected()
{
ClearPendingRequests();
+ ClearPendingChatPreviewSessions();
}
protected override void OnDisposing()
{
ClearPendingRequests();
+ ClearPendingChatPreviewSessions();
}
// Events
public event EventHandler<OpenClawNotification>? NotificationReceived;
+ public event EventHandler<ChatMessageEventArgs>? ChatMessageReceived;
public event EventHandler<AgentActivity>? ActivityChanged;
public event EventHandler<ChannelHealth[]>? ChannelHealthUpdated;
public event EventHandler<SessionInfo[]>? SessionsUpdated;
@@ -118,19 +126,29 @@ public async Task CheckHealthAsync()
}
}
- public async Task SendChatMessageAsync(string message)
+ public async Task SendChatMessageAsync(string message, string? sessionKey = null, string? idempotencyKey = null)
{
if (!IsConnected)
throw new InvalidOperationException("Gateway connection is not open");
- var req = new
+ var requestId = Guid.NewGuid().ToString();
+ var resolvedSessionKey = ResolveChatSessionKey(sessionKey);
+ var resolvedIdempotencyKey = string.IsNullOrWhiteSpace(idempotencyKey)
+ ? Guid.NewGuid().ToString()
+ : idempotencyKey;
+ var parameters = BuildChatSendParameters(message, resolvedSessionKey, resolvedIdempotencyKey);
+
+ TrackPendingRequest(requestId, "chat.send");
+ try
{
- type = "req",
- id = Guid.NewGuid().ToString(),
- method = "chat.send",
- @params = new { message }
- };
- await SendRawAsync(JsonSerializer.Serialize(req));
+ await SendRawAsync(SerializeRequest(requestId, "chat.send", parameters));
+ }
+ catch
+ {
+ RemovePendingRequest(requestId);
+ throw;
+ }
+
_logger.Info($"Sent chat message ({message.Length} chars)");
}
@@ -281,37 +299,41 @@ public async Task<bool> StopChannelAsync(string channelName)
private async Task SendConnectMessageAsync(string? nonce = null)
{
- // Use "cli" client ID for native apps - no browser security checks
var msg = new
{
type = "req",
id = Guid.NewGuid().ToString(),
method = "connect",
- @params = new
- {
- minProtocol = 3,
- maxProtocol = 3,
- client = new
- {
- id = "cli", // Native client ID
- version = "1.0.0",
- platform = "windows",
- mode = "cli",
- displayName = "OpenClaw Windows Tray"
- },
- role = "operator",
- scopes = new[] { "operator.admin", "operator.approvals", "operator.pairing" },
- caps = Array.Empty<string>(),
- commands = Array.Empty<string>(),
- permissions = new { },
- auth = new { token = _token },
- locale = "en-US",
- userAgent = "openclaw-windows-tray/1.0.0"
- }
+ @params = BuildConnectParameters()
};
await SendRawAsync(JsonSerializer.Serialize(msg));
}
+ private object BuildConnectParameters()
+ {
+ return new
+ {
+ minProtocol = 3,
+ maxProtocol = 3,
+ client = new
+ {
+ id = "cli",
+ version = "1.0.0",
+ platform = "windows",
+ mode = "cli",
+ displayName = "OpenClaw Windows Tray"
+ },
+ role = "operator",
+ scopes = new[] { "operator.read", "operator.write", "operator.admin", "operator.approvals", "operator.pairing" },
+ caps = Array.Empty<string>(),
+ commands = Array.Empty<string>(),
+ permissions = new { },
+ auth = new { token = _token },
+ locale = "en-US",
+ userAgent = "openclaw-windows-tray/1.0.0"
+ };
+ }
+
private async Task SendTrackedRequestAsync(string method, object? parameters = null)
{
if (!IsConnected) return;
@@ -456,6 +478,7 @@ private void HandleResponse(JsonElement root)
// Handle hello-ok
if (payload.TryGetProperty("type", out var t) && t.GetString() == "hello-ok")
{
+ UpdateDefaultChatSessionKeyFromHello(payload);
_logger.Info("Handshake complete (hello-ok)");
RaiseStatusChanged(ConnectionStatus.Connected);
@@ -799,13 +822,18 @@ private void HandleToolEvent(JsonElement payload, string sessionKey, bool isMain
private void HandleChatEvent(JsonElement root)
{
_logger.Debug($"Chat event received: {root.GetRawText().Substring(0, Math.Min(200, root.GetRawText().Length))}");
-
+
if (!root.TryGetProperty("payload", out var payload)) return;
+ var sessionKey = NormalizeChatSessionKey(TryGetSessionKey(root, payload));
+ var isFinal = !payload.TryGetProperty("state", out var state) ||
+ string.Equals(state.GetString(), "final", StringComparison.OrdinalIgnoreCase);
+ var emittedAssistantText = false;
// Try new format: payload.message.role + payload.message.content[].text
if (payload.TryGetProperty("message", out var message))
{
- if (message.TryGetProperty("role", out var role) && role.GetString() == "assistant")
+ var roleName = GetString(message, "role");
+ if (roleName == "assistant")
{
// Extract text from content array
if (message.TryGetProperty("content", out var content) && content.ValueKind == JsonValueKind.Array)
@@ -816,11 +844,11 @@ private void HandleChatEvent(JsonElement root)
item.TryGetProperty("text", out var textProp))
{
var text = textProp.GetString() ?? "";
- if (!string.IsNullOrEmpty(text) &&
- payload.TryGetProperty("state", out var state) &&
- state.GetString() == "final")
+ if (!string.IsNullOrEmpty(text) && isFinal)
{
+ emittedAssistantText = true;
_logger.Info($"Assistant response: {text.Substring(0, Math.Min(100, text.Length))}...");
+ EmitChatMessage(sessionKey, roleName ?? "assistant", text, isFinal);
EmitChatNotification(text);
}
}
@@ -833,14 +861,32 @@ private void HandleChatEvent(JsonElement root)
else if (payload.TryGetProperty("text", out var textProp))
{
var text = textProp.GetString() ?? "";
- if (payload.TryGetProperty("role", out var role) &&
- role.GetString() == "assistant" &&
+ var roleName = GetString(payload, "role");
+ if (roleName == "assistant" &&
!string.IsNullOrEmpty(text))
{
+ emittedAssistantText = true;
_logger.Info($"Assistant response (legacy): {text.Substring(0, Math.Min(100, text.Length))}");
+ EmitChatMessage(sessionKey, roleName, text, isFinal: true);
EmitChatNotification(text);
}
}
+
+ if (isFinal && !emittedAssistantText)
+ {
+ RequestChatPreviewForFinalState(sessionKey);
+ }
+ }
+
+ private void EmitChatMessage(string sessionKey, string role, string text, bool isFinal)
+ {
+ ChatMessageReceived?.Invoke(this, new ChatMessageEventArgs
+ {
+ SessionKey = sessionKey,
+ Role = role,
+ Message = text,
+ IsFinal = isFinal
+ });
}
private void EmitChatNotification(string text)
@@ -1053,6 +1099,7 @@ private void ParseSessions(JsonElement sessions)
}
snapshot = GetSessionListInternal();
+ UpdateDefaultChatSessionKeyFromSessions();
}
SessionsUpdated?.Invoke(this, snapshot);
@@ -1081,6 +1128,172 @@ private void ParseSessionItem(JsonElement item)
PopulateSessionFromObject(session, item);
_sessions[session.Key] = session;
+ if (session.IsMain)
+ {
+ UpdateDefaultChatSessionKey(session.Key);
+ }
+ }
+
+ private object BuildChatSendParameters(string message, string sessionKey, string idempotencyKey)
+ {
+ return new
+ {
+ message,
+ sessionKey,
+ idempotencyKey
+ };
+ }
+
+ private string ResolveChatSessionKey(string? sessionKey)
+ {
+ if (!string.IsNullOrWhiteSpace(sessionKey))
+ {
+ return NormalizeChatSessionKey(sessionKey);
+ }
+
+ return string.IsNullOrWhiteSpace(_defaultChatSessionKey)
+ ? DefaultChatSessionKey
+ : _defaultChatSessionKey;
+ }
+
+ private void UpdateDefaultChatSessionKeyFromHello(JsonElement payload)
+ {
+ if (!payload.TryGetProperty("snapshot", out var snapshot) ||
+ snapshot.ValueKind != JsonValueKind.Object ||
+ !snapshot.TryGetProperty("sessionDefaults", out var sessionDefaults) ||
+ sessionDefaults.ValueKind != JsonValueKind.Object)
+ {
+ return;
+ }
+
+ var mainSessionKey = GetString(sessionDefaults, "mainKey") ??
+ GetString(sessionDefaults, "mainSessionKey");
+ if (!string.IsNullOrWhiteSpace(mainSessionKey))
+ {
+ UpdateDefaultChatSessionKey(mainSessionKey);
+ }
+ }
+
+ private void UpdateDefaultChatSessionKeyFromSessions()
+ {
+ var mainSession = _sessions.Values.FirstOrDefault(s => s.IsMain && !string.IsNullOrWhiteSpace(s.Key));
+ if (!string.IsNullOrWhiteSpace(mainSession?.Key))
+ {
+ UpdateDefaultChatSessionKey(mainSession.Key);
+ }
+ }
+
+ private void UpdateDefaultChatSessionKey(string? sessionKey)
+ {
+ if (!string.IsNullOrWhiteSpace(sessionKey))
+ {
+ _defaultChatSessionKey = NormalizeChatSessionKey(sessionKey);
+ }
+ }
+
+ private void RequestChatPreviewForFinalState(string sessionKey)
+ {
+ if (string.IsNullOrWhiteSpace(sessionKey) || _sessionPreviewUnsupported)
+ {
+ return;
+ }
+
+ var normalizedSessionKey = NormalizeChatSessionKey(sessionKey);
+ lock (_pendingChatPreviewLock)
+ {
+ if (!_pendingChatPreviewSessionKeys.Add(normalizedSessionKey))
+ {
+ return;
+ }
+ }
+
+ _ = Task.Run(async () =>
+ {
+ try
+ {
+ await RequestSessionPreviewAsync([normalizedSessionKey], limit: 2, maxChars: 4000);
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"sessions.preview request failed for {normalizedSessionKey}: {ex.Message}");
+ lock (_pendingChatPreviewLock)
+ {
+ _pendingChatPreviewSessionKeys.Remove(normalizedSessionKey);
+ }
+ }
+ });
+ }
+
+ private void EmitPendingChatPreviewMessages(SessionsPreviewPayloadInfo payload)
+ {
+ foreach (var preview in payload.Previews)
+ {
+ var normalizedSessionKey = NormalizeChatSessionKey(preview.Key);
+ var shouldEmit = false;
+
+ lock (_pendingChatPreviewLock)
+ {
+ if (_pendingChatPreviewSessionKeys.Remove(normalizedSessionKey))
+ {
+ shouldEmit = true;
+ }
+ }
+
+ if (!shouldEmit)
+ {
+ continue;
+ }
+
+ var assistantText = preview.Items
+ .LastOrDefault(item => string.Equals(item.Role, "assistant", StringComparison.OrdinalIgnoreCase))?
+ .Text?
+ .Trim();
+
+ if (string.IsNullOrWhiteSpace(assistantText))
+ {
+ continue;
+ }
+
+ _logger.Info($"Assistant response (preview): {assistantText.Substring(0, Math.Min(100, assistantText.Length))}...");
+ EmitChatMessage(normalizedSessionKey, "assistant", assistantText, isFinal: true);
+ EmitChatNotification(assistantText);
+ }
+ }
+
+ private void ClearPendingChatPreviewSessions()
+ {
+ lock (_pendingChatPreviewLock)
+ {
+ _pendingChatPreviewSessionKeys.Clear();
+ }
+ }
+
+ private static string NormalizeChatSessionKey(string? sessionKey)
+ {
+ if (string.IsNullOrWhiteSpace(sessionKey))
+ {
+ return DefaultChatSessionKey;
+ }
+
+ return sessionKey == "main" || sessionKey.Contains(":main:", StringComparison.Ordinal)
+ ? DefaultChatSessionKey
+ : sessionKey;
+ }
+
+ private static string? TryGetSessionKey(JsonElement root, JsonElement payload)
+ {
+ if (root.TryGetProperty("sessionKey", out var rootSessionKey))
+ {
+ return rootSessionKey.GetString();
+ }
+
+ if (payload.ValueKind == JsonValueKind.Object &&
+ payload.TryGetProperty("sessionKey", out var payloadSessionKey))
+ {
+ return payloadSessionKey.GetString();
+ }
+
+ return null;
}
private void PopulateSessionFromObject(SessionInfo session, JsonElement item)
@@ -1394,6 +1607,7 @@ private void ParseSessionsPreview(JsonElement payload)
}
SessionPreviewUpdated?.Invoke(this, previewPayload);
+ EmitPendingChatPreviewMessages(previewPayload);
}
catch (Exception ex)
{
diff --git a/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs b/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
index 424182d..d43f743 100644
--- a/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
+++ b/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
@@ -66,6 +66,54 @@ public SessionInfo[] GetSessionList()
return _client.GetSessionList();
}
+ public string GetDefaultChatSessionKey()
+ {
+ return GetPrivateField<string>("_defaultChatSessionKey");
+ }
+
+ public void UpdateDefaultChatSessionKeyFromHello(string payloadJson)
+ {
+ using var doc = JsonDocument.Parse(payloadJson);
+ var method = typeof(OpenClawGatewayClient).GetMethod(
+ "UpdateDefaultChatSessionKeyFromHello",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
+ method!.Invoke(_client, new object[] { doc.RootElement.Clone() });
+ }
+
+ public string SerializeChatSendRequest(string message, string sessionKey, string idempotencyKey)
+ {
+ var parametersMethod = typeof(OpenClawGatewayClient).GetMethod(
+ "BuildChatSendParameters",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
+ var parameters = parametersMethod!.Invoke(_client, new object[] { message, sessionKey, idempotencyKey });
+
+ var serializeMethod = typeof(OpenClawGatewayClient).GetMethod(
+ "SerializeRequest",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
+ return (string)serializeMethod!.Invoke(null, new object[] { "request-123", "chat.send", parameters! })!;
+ }
+
+ public string SerializeConnectRequest()
+ {
+ var parametersMethod = typeof(OpenClawGatewayClient).GetMethod(
+ "BuildConnectParameters",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
+ var parameters = parametersMethod!.Invoke(_client, Array.Empty<object>());
+
+ var serializeMethod = typeof(OpenClawGatewayClient).GetMethod(
+ "SerializeRequest",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
+ return (string)serializeMethod!.Invoke(null, new object[] { "request-456", "connect", parameters! })!;
+ }
+
+ public string NormalizeChatSessionKey(string? sessionKey)
+ {
+ var method = typeof(OpenClawGatewayClient).GetMethod(
+ "NormalizeChatSessionKey",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
+ return (string)method!.Invoke(null, new object?[] { sessionKey })!;
+ }
+
public void SetUnsupportedMethodFlags(bool usageStatus, bool usageCost, bool sessionPreview, bool nodeList)
{
SetPrivateField("_usageStatusUnsupported", usageStatus);
@@ -122,6 +170,58 @@ public SessionsPreviewPayloadInfo ParseSessionsPreviewPayload(string payloadJson
return parsed ?? new SessionsPreviewPayloadInfo();
}
+ public ChatMessageEventArgs? HandleChatEventAndCaptureMessage(string payloadJson)
+ {
+ ChatMessageEventArgs? captured = null;
+ EventHandler<ChatMessageEventArgs> handler = (_, args) => captured = args;
+ _client.ChatMessageReceived += handler;
+
+ try
+ {
+ using var doc = JsonDocument.Parse(payloadJson);
+ var method = typeof(OpenClawGatewayClient).GetMethod(
+ "HandleChatEvent",
+ System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
+ method!.Invoke(_client, new object[] { doc.RootElement.Clone() });
+ }
+ finally
+ {
+ _client.ChatMessageReceived -= handler;
+ }
+
+ return captured;
+ }
+
+ public int GetPendingChatPreviewSessionCount()
+ {
+ var pending = GetPrivateField<HashSet<string>>("_pendingChatPreviewSessionKeys");
+ return pending.Count;
+ }
+
+ public void AddPendingChatPreviewSession(string sessionKey)
+ {
+ var pending = GetPrivateField<HashSet<string>>("_pendingChatPreviewSessionKeys");
+ pending.Add(sessionKey);
+ }
+
+ public ChatMessageEventArgs? ParseSessionsPreviewPayloadAndCaptureMessage(string payloadJson)
+ {
+ ChatMessageEventArgs? captured = null;
+ EventHandler<ChatMessageEventArgs> handler = (_, args) => captured = args;
+ _client.ChatMessageReceived += handler;
+
+ try
+ {
+ InvokePrivatePayloadParser("ParseSessionsPreview", payloadJson);
+ }
+ finally
+ {
+ _client.ChatMessageReceived -= handler;
+ }
+
+ return captured;
+ }
+
public GatewayNodeInfo[] ParseNodeListPayload(string payloadJson)
{
GatewayNodeInfo[] parsed = Array.Empty<GatewayNodeInfo>();
@@ -670,4 +770,134 @@ public void ParseChannelHealth_StatusField_TakesPriorityOverDerivedStatus()
Assert.Single(channels);
Assert.Equal("degraded", channels[0].Status);
}
+
+ [Fact]
+ public void UpdateDefaultChatSessionKeyFromHello_UsesSnapshotMainSessionKey()
+ {
+ var helper = new GatewayClientTestHelper();
+
+ helper.UpdateDefaultChatSessionKeyFromHello("""
+ {
+ "type": "hello-ok",
+ "snapshot": {
+ "sessionDefaults": {
+ "mainSessionKey": "agent:main:main"
+ }
+ }
+ }
+ """);
+
+ Assert.Equal("main", helper.GetDefaultChatSessionKey());
+ }
+
+ [Fact]
+ public void ParseSessions_MainSession_UpdatesDefaultChatSessionKey()
+ {
+ var helper = new GatewayClientTestHelper();
+
+ helper.ParseSessionsPayload("""
+ {
+ "agent:main:main": {
+ "status": "active",
+ "displayName": "Main",
+ "isMain": true
+ },
+ "agent:other:test": {
+ "status": "active"
+ }
+ }
+ """);
+
+ Assert.Equal("main", helper.GetDefaultChatSessionKey());
+ }
+
+ [Fact]
+ public void SerializeChatSendRequest_IncludesSessionKeyAndIdempotencyKey()
+ {
+ var helper = new GatewayClientTestHelper();
+
+ var json = helper.SerializeChatSendRequest("hello", "main", "idem-123");
+ using var doc = JsonDocument.Parse(json);
+ var parameters = doc.RootElement.GetProperty("params");
+
+ Assert.Equal("hello", parameters.GetProperty("message").GetString());
+ Assert.Equal("main", parameters.GetProperty("sessionKey").GetString());
+ Assert.Equal("idem-123", parameters.GetProperty("idempotencyKey").GetString());
+ }
+
+ [Fact]
+ public void NormalizeChatSessionKey_CollapsesExpandedMainKey()
+ {
+ var helper = new GatewayClientTestHelper();
+
+ Assert.Equal("main", helper.NormalizeChatSessionKey("agent:main:main"));
+ Assert.Equal("main", helper.NormalizeChatSessionKey("main"));
+ Assert.Equal("agent:sub:test", helper.NormalizeChatSessionKey("agent:sub:test"));
+ }
+
+ [Fact]
+ public void HandleChatEvent_FinalWithoutMessage_QueuesPreviewLookup()
+ {
+ var helper = new GatewayClientTestHelper();
+
+ var captured = helper.HandleChatEventAndCaptureMessage("""
+ {
+ "type": "event",
+ "event": "chat",
+ "payload": {
+ "sessionKey": "agent:main:main",
+ "state": "final"
+ }
+ }
+ """);
+
+ Assert.Null(captured);
+ Assert.Equal(1, helper.GetPendingChatPreviewSessionCount());
+ }
+
+ [Fact]
+ public void ParseSessionsPreview_EmitsAssistantMessage_ForQueuedFinalPreview()
+ {
+ var helper = new GatewayClientTestHelper();
+ helper.AddPendingChatPreviewSession("main");
+
+ var captured = helper.ParseSessionsPreviewPayloadAndCaptureMessage("""
+ {
+ "ts": 1739760000000,
+ "previews": [
+ {
+ "key": "agent:main:main",
+ "status": "ok",
+ "items": [
+ { "role": "user", "text": "hello" },
+ { "role": "assistant", "text": "world" }
+ ]
+ }
+ ]
+ }
+ """);
+
+ Assert.NotNull(captured);
+ Assert.Equal("main", captured!.SessionKey);
+ Assert.Equal("assistant", captured.Role);
+ Assert.Equal("world", captured.Message);
+ Assert.True(captured.IsFinal);
+ Assert.Equal(0, helper.GetPendingChatPreviewSessionCount());
+ }
+
+ [Fact]
+ public void SerializeConnectRequest_UsesCliClientModeAndOperatorScopes()
+ {
+ var helper = new GatewayClientTestHelper();
+
+ var json = helper.SerializeConnectRequest();
+ using var doc = JsonDocument.Parse(json);
+ var parameters = doc.RootElement.GetProperty("params");
+ var client = parameters.GetProperty("client");
+ var scopes = parameters.GetProperty("scopes").EnumerateArray().Select(item => item.GetString()).ToArray();
+
+ Assert.Equal("cli", client.GetProperty("mode").GetString());
+ Assert.Contains("operator.read", scopes);
+ Assert.Contains("operator.write", scopes);
+ }
}
From a81d31ea706b6d065f615e1e223fb90d89c5610f Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 00:25:51 +0000
Subject: [PATCH 03/83] Add configurable voice mode settings and setup UI
---
src/OpenClaw.Shared/VoiceModeSchema.cs | 11 +++++
.../Windows/VoiceModeWindow.xaml | 18 ++++++++
.../Windows/VoiceModeWindow.xaml.cs | 41 +++++++++++++++++--
.../VoiceModeSchemaTests.cs | 2 +
.../SettingsRoundTripTests.cs | 8 +++-
5 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 0dce2a5..c160f5e 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -37,21 +37,31 @@ public enum VoiceActivationMode
public enum VoiceRuntimeState
{
Stopped,
+ Paused,
Idle,
Arming,
ListeningForWakeWord,
ListeningContinuously,
RecordingUtterance,
SubmittingAudio,
+ PendingManualSend,
AwaitingResponse,
PlayingResponse,
Error
}
+[JsonConverter(typeof(JsonStringEnumConverter<VoiceChatWindowSubmitMode>))]
+public enum VoiceChatWindowSubmitMode
+{
+ AutoSend,
+ WaitForUser
+}
+
public sealed class VoiceSettings
{
public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
public bool Enabled { get; set; }
+ public bool ShowConversationToasts { get; set; }
public string SpeechToTextProviderId { get; set; } = VoiceProviderIds.Windows;
public string TextToSpeechProviderId { get; set; } = VoiceProviderIds.Windows;
public string? InputDeviceId { get; set; }
@@ -79,6 +89,7 @@ public sealed class VoiceAlwaysOnSettings
public int EndSilenceMs { get; set; } = 900;
public int MaxUtteranceMs { get; set; } = 15000;
public bool AutoSubmit { get; set; } = true;
+ public VoiceChatWindowSubmitMode ChatWindowSubmitMode { get; set; } = VoiceChatWindowSubmitMode.AutoSend;
}
public sealed class VoiceAudioDeviceInfo
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
index 57cb962..01b8ec3 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
@@ -54,6 +54,24 @@
<Button Content="Refresh devices" HorizontalAlignment="Left" Click="OnRefreshDevices"/>
</StackPanel>
+ <StackPanel Spacing="8">
+ <TextBlock Text="NOTIFICATIONS" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <CheckBox x:Name="VoiceConversationToastsCheckBox"
+ Content="Show voice transcripts and replies as toasts"/>
+ </StackPanel>
+
+ <StackPanel Spacing="8">
+ <TextBlock Text="CHAT WINDOW" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <ComboBox x:Name="ChatWindowSubmitModeComboBox" Header="When the tray chat window is open">
+ <ComboBoxItem Content="Send automatically" Tag="AutoSend"/>
+ <ComboBoxItem Content="Fill message box and wait for me to send" Tag="WaitForUser"/>
+ </ComboBox>
+ <TextBlock Text="This setting only applies while the tray chat window is open. In windowless voice mode, utterances are always sent automatically."
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </StackPanel>
+
<TextBlock x:Name="StatusTextBlock"
Style="{StaticResource CaptionTextBlockStyle}"
Foreground="{ThemeResource TextFillColorSecondaryBrush}"
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index 8137732..ade1a83 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -45,6 +45,8 @@ private void LoadSettings()
{
LoadProviders();
SelectMode(_settings.Voice.Mode);
+ SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
+ VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
UpdateModeInfo();
UpdateProviderInfo();
StatusTextBlock.Text = BuildStatusText();
@@ -141,12 +143,38 @@ private VoiceActivationMode GetSelectedMode()
};
}
+ private void SelectChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
+ {
+ var target = mode == VoiceChatWindowSubmitMode.WaitForUser ? "WaitForUser" : "AutoSend";
+
+ foreach (var item in ChatWindowSubmitModeComboBox.Items.OfType<ComboBoxItem>())
+ {
+ if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
+ {
+ ChatWindowSubmitModeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ ChatWindowSubmitModeComboBox.SelectedIndex = 0;
+ }
+
+ private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
+ {
+ var tag = (ChatWindowSubmitModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
+ return tag == "WaitForUser"
+ ? VoiceChatWindowSubmitMode.WaitForUser
+ : VoiceChatWindowSubmitMode.AutoSend;
+ }
+
private string BuildStatusText()
{
var running = _voiceService.CurrentStatus;
- var runtime = running.Running
+ var runtime = running.State == VoiceRuntimeState.Paused
? $"{running.Mode} ({running.State})"
- : "Off";
+ : running.Running
+ ? $"{running.Mode} ({running.State})"
+ : "Off";
var nodeMode = _settings.EnableNodeMode ? "enabled" : "disabled";
var stt = (SpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
var tts = (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
@@ -157,6 +185,11 @@ private string BuildStatusText()
return $"Runtime: {runtime}. Node Mode is {nodeMode}. STT: {stt}. TTS: {tts}.{error}";
}
+ public void RefreshStatus()
+ {
+ StatusTextBlock.Text = BuildStatusText();
+ }
+
private void UpdateModeInfo()
{
var mode = GetSelectedMode();
@@ -256,6 +289,7 @@ private async void OnSave(object sender, RoutedEventArgs e)
{
Mode = GetSelectedMode(),
Enabled = GetSelectedMode() != VoiceActivationMode.Off,
+ ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
SpeechToTextProviderId = (SpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
TextToSpeechProviderId = (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
InputDeviceId = (InputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
@@ -277,7 +311,8 @@ private async void OnSave(object sender, RoutedEventArgs e)
MinSpeechMs = _settings.Voice.AlwaysOn.MinSpeechMs,
EndSilenceMs = _settings.Voice.AlwaysOn.EndSilenceMs,
MaxUtteranceMs = _settings.Voice.AlwaysOn.MaxUtteranceMs,
- AutoSubmit = _settings.Voice.AlwaysOn.AutoSubmit
+ AutoSubmit = _settings.Voice.AlwaysOn.AutoSubmit,
+ ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
}
};
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 3fd1e85..8ec7478 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -30,6 +30,7 @@ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
Assert.False(settings.Enabled);
Assert.Equal(VoiceActivationMode.Off, settings.Mode);
+ Assert.False(settings.ShowConversationToasts);
Assert.Equal(VoiceProviderIds.Windows, settings.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.TextToSpeechProviderId);
Assert.Equal(16000, settings.SampleRateHz);
@@ -40,6 +41,7 @@ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
Assert.Equal(0.65f, settings.WakeWord.TriggerThreshold);
Assert.Equal(250, settings.AlwaysOn.MinSpeechMs);
Assert.True(settings.AlwaysOn.AutoSubmit);
+ Assert.Equal(VoiceChatWindowSubmitMode.AutoSend, settings.AlwaysOn.ChatWindowSubmitMode);
}
[Fact]
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 7533d1b..0c5ba4f 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -32,6 +32,7 @@ public void RoundTrip_AllFields_Preserved()
{
Enabled = true,
Mode = VoiceActivationMode.WakeWord,
+ ShowConversationToasts = true,
SpeechToTextProviderId = "windows",
TextToSpeechProviderId = "elevenlabs",
InputDeviceId = "mic-1",
@@ -53,7 +54,8 @@ public void RoundTrip_AllFields_Preserved()
MinSpeechMs = 300,
EndSilenceMs = 1100,
MaxUtteranceMs = 18000,
- AutoSubmit = false
+ AutoSubmit = false,
+ ChatWindowSubmitMode = VoiceChatWindowSubmitMode.WaitForUser
}
},
UserRules = new List<UserNotificationRule>
@@ -87,6 +89,7 @@ public void RoundTrip_AllFields_Preserved()
Assert.NotNull(restored.Voice);
Assert.True(restored.Voice.Enabled);
Assert.Equal(VoiceActivationMode.WakeWord, restored.Voice.Mode);
+ Assert.True(restored.Voice.ShowConversationToasts);
Assert.Equal("windows", restored.Voice.SpeechToTextProviderId);
Assert.Equal("elevenlabs", restored.Voice.TextToSpeechProviderId);
Assert.Equal("mic-1", restored.Voice.InputDeviceId);
@@ -96,6 +99,7 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal(0.72f, restored.Voice.WakeWord.TriggerThreshold);
Assert.Equal(300, restored.Voice.AlwaysOn.MinSpeechMs);
Assert.False(restored.Voice.AlwaysOn.AutoSubmit);
+ Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.AlwaysOn.ChatWindowSubmitMode);
Assert.NotNull(restored.UserRules);
Assert.Single(restored.UserRules);
Assert.Equal("build.*fail", restored.UserRules[0].Pattern);
@@ -144,6 +148,7 @@ public void MissingFields_UseDefaults()
Assert.NotNull(settings.Voice);
Assert.False(settings.Voice.Enabled);
Assert.Equal(VoiceActivationMode.Off, settings.Voice.Mode);
+ Assert.False(settings.Voice.ShowConversationToasts);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
Assert.Equal(16000, settings.Voice.SampleRateHz);
@@ -187,6 +192,7 @@ public void BackwardCompatibility_OldSettingsWithoutNewFields()
Assert.NotNull(settings.Voice);
Assert.False(settings.Voice.Enabled);
Assert.Equal(VoiceActivationMode.Off, settings.Voice.Mode);
+ Assert.False(settings.Voice.ShowConversationToasts);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
Assert.Null(settings.UserRules);
From 197a89b7412c6b321802d504faf09466bcb46b95 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 00:26:16 +0000
Subject: [PATCH 04/83] Integrate always-on voice mode with tray chat workflow
---
docs/VOICE-MODE.md | 70 ++-
src/OpenClaw.Tray.WinUI/App.xaml.cs | 198 ++++++++-
.../Services/GlobalHotkeyService.cs | 55 ++-
.../Services/VoiceChatCoordinator.cs | 153 +++++++
.../Services/VoiceService.cs | 402 ++++++++++++++++--
.../Windows/WebChatWindow.xaml.cs | 274 ++++++++++++
6 files changed, 1106 insertions(+), 46 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 664be7f..5347acd 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -51,13 +51,75 @@ For macOS parity, `AlwaysOn` should follow Talk Mode's documented control flow:
- the node captures audio locally
- local speech recognition turns that audio into transcript text
-- the transcript is sent to OpenClaw via `chat.send` on the main session
+- if the tray chat window is open and ready, the final transcript is submitted through the tray chat window's own compose/send path
+- otherwise, the transcript is sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
- the node performs local TTS playback of that reply
That means the first Windows parity target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
-The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists only to carry `chat.send` and assistant chat events for `AlwaysOn`.
+The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `AlwaysOn`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
+
+## Tray Chat Integration Decision
+
+Voice mode and typed chat must remain part of the same user-visible conversation in the tray app. Creating a separate "voice session" would reduce implementation complexity, but it would make the chat experience harder to understand:
+
+- voice utterances would not appear in the same tray chat history as typed messages
+- the user would need to reason about two concurrent sessions for one tray app
+- voice replies and typed replies could diverge across windows
+
+### Problem Encountered
+
+When `AlwaysOn` sends transcript text to the main OpenClaw session, the upstream session can include scaffolding such as `<relevant-memories>...</relevant-memories>` in the rendered user message body shown in the tray chat window.
+
+That produced two UX problems:
+
+- the tray chat bubble did not show the clean spoken transcript the user actually said
+- the embedded tray chat window had no draft/update API for showing interim STT hypotheses while the user was still speaking
+
+### Routes Examined
+
+1. Dedicated voice session
+ - technically clean from a transport perspective
+ - rejected because it fragments the tray chat experience and is confusing for users
+2. Upstream OpenClaw change to suppress memory scaffolding for voice turns
+ - desirable long-term if OpenClaw exposes a first-class voice-aware chat surface
+ - rejected for the current phase because this Windows tray feature must work without waiting for upstream protocol/UI changes
+3. Tray-local DOM mediation in the embedded chat window
+ - chosen
+ - keeps a single session and single tray chat history
+ - allows interim hypotheses to appear in the tray compose box in near real time
+ - allows the tray app to submit through the same UI path as typed messages when the tray chat window is open
+4. Hybrid submission path
+ - chosen
+ - when the tray chat window is open, voice submits through the chat window DOM send path
+ - when the tray chat window is closed or unavailable, voice falls back to direct `chat.send`
+ - preserves windowless voice mode without forcing the transport layer to depend on WebView availability
+
+### Chosen Approach
+
+The tray app keeps a tray-local interim transcript buffer for the current utterance, independent of whether the chat window is open.
+
+The embedded [WebChatWindow.xaml.cs](C:/dev/openclaw-windows-node/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs) owns the tray-local chat integration layer:
+
+- interim STT hypotheses from Windows speech recognition are injected into the tray chat compose box while the user is speaking
+- if the chat window opens during an utterance, the current buffered transcript is copied into the compose box immediately
+- if the chat window closes during an utterance, voice continues windowless and the final utterance still submits
+- if the chat window is open and ready when the utterance finalizes, the tray app either auto-submits through the page's own send path or leaves the draft for manual send, depending on `Voice.AlwaysOn.ChatWindowSubmitMode`
+- in `WaitForUser` mode, voice capture pauses after finalizing the draft so the next utterance does not overwrite the unsent message
+- if the chat window is not open or not ready, the voice service falls back to direct `chat.send`
+- rendered chat content inside the tray window is still sanitized to remove `<relevant-memories>...</relevant-memories>` blocks as a fallback for messages that were sent while windowless
+
+This is intentionally a tray-local integration decision, not a protocol-level rewrite of the stored upstream transcript.
+
+### Tradeoffs
+
+- preserves a single visible conversation for the user
+- avoids a second voice-only session in the tray UI
+- when the tray chat window is open, voice follows the same send path as typed tray-chat messages
+- depends on DOM integration inside the embedded WebView chat surface because OpenClaw does not currently expose a dedicated draft/update or voice-submit API for the tray app
+- still requires a direct fallback path for windowless voice mode
+- only affects the tray app chat window; other clients still render upstream content according to their own rules
## Provider Selection
@@ -191,7 +253,8 @@ Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](C:/dev
"MinSpeechMs": 250,
"EndSilenceMs": 900,
"MaxUtteranceMs": 15000,
- "AutoSubmit": true
+ "AutoSubmit": true,
+ "ChatWindowSubmitMode": "AutoSend"
}
}
}
@@ -235,6 +298,7 @@ Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](C:/dev
| `Voice.AlwaysOn.EndSilenceMs` | int | `900` | always-on | Silence timeout used to finalize an utterance |
| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
| `Voice.AlwaysOn.AutoSubmit` | bool | `true` | always-on | If `true`, completed utterances are submitted immediately without extra confirmation |
+| `Voice.AlwaysOn.ChatWindowSubmitMode` | enum | `AutoSend` | always-on | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 37552b9..ab6feaa 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -74,6 +74,7 @@ public partial class App : Application
// Node service (optional, enabled in settings)
private NodeService? _nodeService;
private VoiceService? _voiceService;
+ private VoiceChatCoordinator? _voiceChatCoordinator;
// Keep-alive window to anchor WinUI runtime (prevents GC/threading issues)
private Window? _keepAliveWindow;
@@ -253,6 +254,8 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
// Initialize settings
_settings = new SettingsManager();
_voiceService = new VoiceService(new AppLogger(), _settings);
+ _voiceChatCoordinator = new VoiceChatCoordinator(_voiceService, _settings, _dispatcherQueue!);
+ _voiceChatCoordinator.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
// First-run check
if (string.IsNullOrWhiteSpace(_settings.Token))
@@ -287,7 +290,8 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
if (_settings.GlobalHotkeyEnabled)
{
_globalHotkey = new GlobalHotkeyService();
- _globalHotkey.HotkeyPressed += OnGlobalHotkeyPressed;
+ _globalHotkey.QuickSendHotkeyPressed += OnGlobalQuickSendHotkeyPressed;
+ _globalHotkey.VoiceToggleHotkeyPressed += OnGlobalVoiceToggleHotkeyPressed;
_globalHotkey.Register();
}
@@ -518,6 +522,7 @@ private void OnTrayMenuItemClicked(object? sender, string action)
{
case "status": ShowStatusDetail(); break;
case "voice-settings": ShowVoiceModeSettings(); break;
+ case "voice-toggle-pause": _ = ToggleVoiceQuickPauseAsync(); break;
case "dashboard": OpenDashboard(); break;
case "webchat": ShowWebChat(); break;
case "quicksend": ShowQuickSend(); break;
@@ -732,7 +737,17 @@ private List<string> GetRecentActivity(int maxItems)
private string GetRunningVoiceModeLabel()
{
var status = _voiceService?.CurrentStatus;
- if (status?.Running == true)
+ if (status == null)
+ {
+ return "Off";
+ }
+
+ if (status.State == VoiceRuntimeState.Paused)
+ {
+ return $"{status.Mode} (Paused)";
+ }
+
+ if (status.Running)
{
return status.Mode switch
{
@@ -745,6 +760,30 @@ private string GetRunningVoiceModeLabel()
return "Off";
}
+ private bool CanQuickToggleVoiceMode()
+ {
+ if (_settings?.EnableNodeMode != true || _voiceService == null)
+ {
+ return false;
+ }
+
+ var status = _voiceService.CurrentStatus;
+ if (status.State == VoiceRuntimeState.Paused)
+ {
+ return true;
+ }
+
+ return _settings.Voice.Enabled && _settings.Voice.Mode != VoiceActivationMode.Off;
+ }
+
+ private string GetVoiceQuickToggleLabel()
+ {
+ var status = _voiceService?.CurrentStatus;
+ return status?.State == VoiceRuntimeState.Paused
+ ? "Resume Voice"
+ : "Pause Voice";
+ }
+
private string GetVoiceDeviceSummary()
{
var voice = _settings?.Voice;
@@ -774,6 +813,7 @@ private void BuildTrayMenuPopup(TrayMenuWindow menu)
menu.AddMenuItem($"Voice Mode: {GetRunningVoiceModeLabel()}", "🎙️", "voice-settings");
menu.AddMenuItem($"↳ {GetVoiceDeviceSummary()}", "", "", isEnabled: false, indent: true);
+ menu.AddMenuItem($"↳ {GetVoiceQuickToggleLabel()} (Ctrl+Alt+Shift+V)", "", "voice-toggle-pause", isEnabled: CanQuickToggleVoiceMode(), indent: true);
if (_settings?.EnableNodeMode != true)
{
menu.AddMenuItem("↳ Enable Node Mode to activate voice runtime", "", "", isEnabled: false, indent: true);
@@ -1676,8 +1716,10 @@ private void OnSettingsSaved(object? sender, EventArgs e)
if (_settings!.GlobalHotkeyEnabled)
{
_globalHotkey ??= new GlobalHotkeyService();
- _globalHotkey.HotkeyPressed -= OnGlobalHotkeyPressed;
- _globalHotkey.HotkeyPressed += OnGlobalHotkeyPressed;
+ _globalHotkey.QuickSendHotkeyPressed -= OnGlobalQuickSendHotkeyPressed;
+ _globalHotkey.QuickSendHotkeyPressed += OnGlobalQuickSendHotkeyPressed;
+ _globalHotkey.VoiceToggleHotkeyPressed -= OnGlobalVoiceToggleHotkeyPressed;
+ _globalHotkey.VoiceToggleHotkeyPressed += OnGlobalVoiceToggleHotkeyPressed;
_globalHotkey.Register();
}
else
@@ -1694,7 +1736,12 @@ private void ShowWebChat()
if (_webChatWindow == null || _webChatWindow.IsClosed)
{
_webChatWindow = new WebChatWindow(_settings!.GatewayUrl, _settings.Token);
- _webChatWindow.Closed += (s, e) => _webChatWindow = null;
+ _webChatWindow.Closed += (s, e) =>
+ {
+ _voiceChatCoordinator?.DetachWindow(_webChatWindow);
+ _webChatWindow = null;
+ };
+ _voiceChatCoordinator?.AttachWindow(_webChatWindow);
}
_webChatWindow.Activate();
}
@@ -1897,7 +1944,7 @@ private void OpenLogFile()
}
}
- private void OnGlobalHotkeyPressed(object? sender, EventArgs e)
+ private void OnGlobalQuickSendHotkeyPressed(object? sender, EventArgs e)
{
// Hotkey events are raised from a dedicated Win32 message-loop thread.
// Creating/activating WinUI windows must happen on the app's UI thread.
@@ -1914,6 +1961,136 @@ private void OnGlobalHotkeyPressed(object? sender, EventArgs e)
}
}
+ private void OnGlobalVoiceToggleHotkeyPressed(object? sender, EventArgs e)
+ {
+ if (_dispatcherQueue == null)
+ {
+ Logger.Warn("Voice hotkey pressed but DispatcherQueue is null");
+ return;
+ }
+
+ var enqueued = _dispatcherQueue.TryEnqueue(async () => await ToggleVoiceQuickPauseAsync());
+ if (!enqueued)
+ {
+ Logger.Warn("Voice hotkey pressed but failed to enqueue Voice quick pause on UI thread");
+ }
+ }
+
+ private async Task ToggleVoiceQuickPauseAsync()
+ {
+ if (_voiceService == null)
+ {
+ return;
+ }
+
+ if (_settings?.EnableNodeMode != true)
+ {
+ Logger.Warn("Voice quick pause blocked: Node Mode is disabled");
+ return;
+ }
+
+ if (!CanQuickToggleVoiceMode())
+ {
+ Logger.Warn("Voice quick pause blocked: Voice Mode is off");
+ return;
+ }
+
+ try
+ {
+ var status = await _voiceService.ToggleQuickPauseAsync();
+ _voiceModeWindow?.RefreshStatus();
+ ShowVoiceQuickToggleToast(status);
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Voice quick pause failed: {ex.Message}");
+ }
+ }
+
+ private static void ShowVoiceQuickToggleToast(VoiceStatusInfo status)
+ {
+ try
+ {
+ var title = status.State == VoiceRuntimeState.Paused
+ ? "Voice paused"
+ : "Voice resumed";
+ var detail = status.State == VoiceRuntimeState.Paused
+ ? $"{status.Mode} is paused. Press Ctrl+Alt+Shift+V to resume."
+ : $"{status.Mode} is active again.";
+
+ new ToastContentBuilder()
+ .AddText(title)
+ .AddText(detail)
+ .Show();
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Failed to show voice pause toast: {ex.Message}");
+ }
+ }
+
+ private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationTurnEventArgs args)
+ {
+ if (_dispatcherQueue == null)
+ {
+ return;
+ }
+
+ _dispatcherQueue.TryEnqueue(() => ShowVoiceConversationToast(args));
+ }
+
+ private void ShowVoiceConversationToast(VoiceConversationTurnEventArgs args)
+ {
+ if (_settings?.Voice.ShowConversationToasts != true)
+ {
+ return;
+ }
+
+ var title = args.Direction == VoiceConversationDirection.Outgoing
+ ? "Voice heard"
+ : "Voice reply";
+
+ AddRecentActivity(
+ $"voice: {title}",
+ category: "voice",
+ details: args.Message,
+ dashboardPath: "chat",
+ sessionKey: args.SessionKey);
+
+ NotificationHistoryService.AddNotification(new Services.GatewayNotification
+ {
+ Title = title,
+ Message = args.Message,
+ Category = "voice"
+ });
+
+ if (_settings.ShowNotifications != true)
+ {
+ return;
+ }
+
+ try
+ {
+ var builder = new ToastContentBuilder()
+ .AddText(title)
+ .AddText(args.Message);
+
+ if (args.Direction == VoiceConversationDirection.Incoming)
+ {
+ builder.AddArgument("action", "open_chat")
+ .AddButton(new ToastButton()
+ .SetContent("Open Chat")
+ .AddArgument("action", "open_chat"));
+ }
+
+ builder.Show();
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Failed to show voice conversation toast: {ex.Message}");
+ }
+ }
+
#endregion
#region Updates
@@ -2123,6 +2300,15 @@ private void ExitApplication()
// Dispose cancellation token source
_deepLinkCts?.Dispose();
+ if (_voiceChatCoordinator != null)
+ {
+ _voiceChatCoordinator.ConversationTurnAvailable -= OnVoiceConversationTurnAvailable;
+ _voiceChatCoordinator.Dispose();
+ }
+ if (_voiceService != null)
+ {
+ _voiceService.TranscriptSubmitter = null;
+ }
_voiceService?.Dispose();
Exit();
diff --git a/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs b/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs
index d0e5f93..92496d5 100644
--- a/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs
@@ -7,15 +7,19 @@ namespace OpenClawTray.Services;
/// <summary>
/// Registers and handles global hotkeys using P/Invoke.
-/// Default: Ctrl+Alt+Shift+C for Quick Send.
+/// Defaults:
+/// - Ctrl+Alt+Shift+C for Quick Send
+/// - Ctrl+Alt+Shift+V for Voice pause/resume
/// </summary>
public class GlobalHotkeyService : IDisposable
{
- private const int HOTKEY_ID = 9001;
+ private const int QUICK_SEND_HOTKEY_ID = 9001;
+ private const int VOICE_TOGGLE_HOTKEY_ID = 9002;
private const uint MOD_CONTROL = 0x0002;
private const uint MOD_ALT = 0x0001;
private const uint MOD_SHIFT = 0x0004;
private const uint VK_C = 0x43;
+ private const uint VK_V = 0x56;
private const int WM_HOTKEY = 0x0312;
[DllImport("user32.dll", SetLastError = true)]
@@ -113,7 +117,8 @@ private struct POINT
private readonly ManualResetEventSlim _windowReady = new(false);
private readonly ManualResetEventSlim _opCompleted = new(false);
- public event EventHandler? HotkeyPressed;
+ public event EventHandler? QuickSendHotkeyPressed;
+ public event EventHandler? VoiceToggleHotkeyPressed;
public GlobalHotkeyService()
{
@@ -225,19 +230,34 @@ private IntPtr WndProc(IntPtr hWnd, uint msg, IntPtr wParam, IntPtr lParam)
if (msg == WM_APP_REGISTER)
{
// Register from the message-loop thread that owns hWnd.
- _registered = RegisterHotKey(hWnd, HOTKEY_ID,
+ var quickSendRegistered = RegisterHotKey(hWnd, QUICK_SEND_HOTKEY_ID,
MOD_CONTROL | MOD_ALT | MOD_SHIFT | MOD_NOREPEAT,
VK_C);
+ var voiceToggleRegistered = RegisterHotKey(hWnd, VOICE_TOGGLE_HOTKEY_ID,
+ MOD_CONTROL | MOD_ALT | MOD_SHIFT | MOD_NOREPEAT,
+ VK_V);
+
+ _registered = quickSendRegistered && voiceToggleRegistered;
if (_registered)
{
- Logger.Info("Global hotkey registered: Ctrl+Alt+Shift+C");
+ Logger.Info("Global hotkeys registered: Ctrl+Alt+Shift+C (Quick Send), Ctrl+Alt+Shift+V (Voice Pause)");
}
else
{
+ if (quickSendRegistered)
+ {
+ UnregisterHotKey(hWnd, QUICK_SEND_HOTKEY_ID);
+ }
+
+ if (voiceToggleRegistered)
+ {
+ UnregisterHotKey(hWnd, VOICE_TOGGLE_HOTKEY_ID);
+ }
+
var err = Marshal.GetLastWin32Error();
var errMsg = new Win32Exception(err).Message;
- Logger.Warn($"Failed to register global hotkey (Win32Error={err}: {errMsg})");
+ Logger.Warn($"Failed to register one or more global hotkeys (Win32Error={err}: {errMsg})");
}
_opCompleted.Set();
@@ -250,9 +270,10 @@ private IntPtr WndProc(IntPtr hWnd, uint msg, IntPtr wParam, IntPtr lParam)
{
if (_registered)
{
- UnregisterHotKey(hWnd, HOTKEY_ID);
+ UnregisterHotKey(hWnd, QUICK_SEND_HOTKEY_ID);
+ UnregisterHotKey(hWnd, VOICE_TOGGLE_HOTKEY_ID);
_registered = false;
- Logger.Info("Global hotkey unregistered");
+ Logger.Info("Global hotkeys unregistered");
}
}
catch (Exception ex)
@@ -266,10 +287,15 @@ private IntPtr WndProc(IntPtr hWnd, uint msg, IntPtr wParam, IntPtr lParam)
return IntPtr.Zero;
}
- if (msg == WM_HOTKEY && wParam.ToInt32() == HOTKEY_ID)
+ if (msg == WM_HOTKEY && wParam.ToInt32() == QUICK_SEND_HOTKEY_ID)
{
Logger.Info("Hotkey pressed: Ctrl+Alt+Shift+C");
- OnHotkeyPressed();
+ OnQuickSendHotkeyPressed();
+ }
+ else if (msg == WM_HOTKEY && wParam.ToInt32() == VOICE_TOGGLE_HOTKEY_ID)
+ {
+ Logger.Info("Hotkey pressed: Ctrl+Alt+Shift+V");
+ OnVoiceToggleHotkeyPressed();
}
return DefWindowProc(hWnd, msg, wParam, lParam);
}
@@ -302,9 +328,14 @@ public void Unregister()
}
}
- internal void OnHotkeyPressed()
+ internal void OnQuickSendHotkeyPressed()
+ {
+ QuickSendHotkeyPressed?.Invoke(this, EventArgs.Empty);
+ }
+
+ internal void OnVoiceToggleHotkeyPressed()
{
- HotkeyPressed?.Invoke(this, EventArgs.Empty);
+ VoiceToggleHotkeyPressed?.Invoke(this, EventArgs.Empty);
}
public void Dispose()
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
new file mode 100644
index 0000000..eb86180
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
@@ -0,0 +1,153 @@
+using Microsoft.UI.Dispatching;
+using OpenClaw.Shared;
+using OpenClawTray.Windows;
+using System;
+using System.Threading.Tasks;
+
+namespace OpenClawTray.Services;
+
+public sealed class VoiceChatCoordinator : IDisposable
+{
+ private readonly VoiceService _voiceService;
+ private readonly SettingsManager _settings;
+ private readonly DispatcherQueue _dispatcherQueue;
+ private readonly object _gate = new();
+
+ private WebChatWindow? _webChatWindow;
+ private string _voiceTranscriptDraftText = string.Empty;
+ private bool _disposed;
+
+ public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
+
+ public VoiceChatCoordinator(
+ VoiceService voiceService,
+ SettingsManager settings,
+ DispatcherQueue dispatcherQueue)
+ {
+ _voiceService = voiceService;
+ _settings = settings;
+ _dispatcherQueue = dispatcherQueue;
+
+ _voiceService.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
+ _voiceService.TranscriptDraftUpdated += OnVoiceTranscriptDraftUpdated;
+ _voiceService.TranscriptSubmitter = SubmitVoiceTranscriptAsync;
+ }
+
+ public void AttachWindow(WebChatWindow window)
+ {
+ ArgumentNullException.ThrowIfNull(window);
+
+ lock (_gate)
+ {
+ if (ReferenceEquals(_webChatWindow, window))
+ {
+ return;
+ }
+
+ if (_webChatWindow != null)
+ {
+ _webChatWindow.VoiceTranscriptSubmitted -= OnWebChatVoiceTranscriptSubmitted;
+ }
+
+ _webChatWindow = window;
+ _webChatWindow.VoiceTranscriptSubmitted += OnWebChatVoiceTranscriptSubmitted;
+ }
+
+ _ = window.UpdateVoiceTranscriptDraftAsync(
+ _voiceTranscriptDraftText,
+ clear: string.IsNullOrWhiteSpace(_voiceTranscriptDraftText));
+ }
+
+ public void DetachWindow(WebChatWindow? window)
+ {
+ lock (_gate)
+ {
+ if (_webChatWindow == null)
+ {
+ return;
+ }
+
+ if (window != null && !ReferenceEquals(_webChatWindow, window))
+ {
+ return;
+ }
+
+ _webChatWindow.VoiceTranscriptSubmitted -= OnWebChatVoiceTranscriptSubmitted;
+ _webChatWindow = null;
+ }
+ }
+
+ public void Dispose()
+ {
+ if (_disposed)
+ {
+ return;
+ }
+
+ _disposed = true;
+ DetachWindow(null);
+ _voiceService.ConversationTurnAvailable -= OnVoiceConversationTurnAvailable;
+ _voiceService.TranscriptDraftUpdated -= OnVoiceTranscriptDraftUpdated;
+ _voiceService.TranscriptSubmitter = null;
+ }
+
+ private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationTurnEventArgs args)
+ {
+ _dispatcherQueue.TryEnqueue(() =>
+ {
+ ConversationTurnAvailable?.Invoke(this, args);
+ });
+ }
+
+ private void OnVoiceTranscriptDraftUpdated(object? sender, VoiceTranscriptDraftEventArgs args)
+ {
+ _dispatcherQueue.TryEnqueue(() =>
+ {
+ _voiceTranscriptDraftText = args.Clear ? string.Empty : (args.Text ?? string.Empty);
+
+ WebChatWindow? window;
+ lock (_gate)
+ {
+ window = _webChatWindow;
+ }
+
+ if (window == null || window.IsClosed)
+ {
+ return;
+ }
+
+ _ = window.UpdateVoiceTranscriptDraftAsync(_voiceTranscriptDraftText, args.Clear);
+ });
+ }
+
+ private void OnWebChatVoiceTranscriptSubmitted(object? sender, VoiceTranscriptSubmittedEventArgs args)
+ {
+ _voiceTranscriptDraftText = string.Empty;
+ _voiceService.NotifyManualTranscriptSubmitted(args.Text, args.SessionKey);
+ }
+
+ private async Task<VoiceTranscriptSubmitOutcome> SubmitVoiceTranscriptAsync(string text, string? sessionKey)
+ {
+ WebChatWindow? window;
+ lock (_gate)
+ {
+ window = _webChatWindow;
+ }
+
+ if (window == null || window.IsClosed)
+ {
+ return VoiceTranscriptSubmitOutcome.Unavailable;
+ }
+
+ if (_settings.Voice.AlwaysOn.ChatWindowSubmitMode == VoiceChatWindowSubmitMode.WaitForUser)
+ {
+ return await window.PrepareVoiceTranscriptForManualSendAsync(text)
+ ? VoiceTranscriptSubmitOutcome.DeferredToUser
+ : VoiceTranscriptSubmitOutcome.Unavailable;
+ }
+
+ return await window.TrySubmitVoiceTranscriptAsync(text)
+ ? VoiceTranscriptSubmitOutcome.Submitted
+ : VoiceTranscriptSubmitOutcome.Unavailable;
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
index 3d3e982..5eb1367 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
@@ -1,6 +1,7 @@
using System;
using System.Collections.Generic;
using System.Linq;
+using System.Text.Json;
using System.Threading;
using System.Threading.Tasks;
using OpenClaw.Shared;
@@ -39,10 +40,16 @@ public sealed class VoiceService : IDisposable
private bool _recognitionActive;
private bool _awaitingReply;
private bool _isSpeaking;
+ private bool _quickPaused;
private string? _lastTranscript;
private DateTime _lastTranscriptUtc;
+ private string? _pendingManualTranscript;
private bool _disposed;
+ public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
+ public event EventHandler<VoiceTranscriptDraftEventArgs>? TranscriptDraftUpdated;
+ public Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? TranscriptSubmitter { get; set; }
+
public VoiceService(IOpenClawLogger logger, SettingsManager settings)
{
_logger = logger;
@@ -82,7 +89,19 @@ public Task<VoiceSettings> UpdateSettingsAsync(VoiceSettingsUpdateArgs update)
_settings.Save();
}
- if (_status.Running)
+ if (! _settings.Voice.Enabled || _settings.Voice.Mode == VoiceActivationMode.Off)
+ {
+ _quickPaused = false;
+ _status = BuildStoppedStatus(_status.SessionKey, _status.LastError);
+ }
+ else if (_quickPaused || _status.State == VoiceRuntimeState.Paused)
+ {
+ _status = BuildPausedStatus(
+ _runtimeModeOverride ?? _settings.Voice.Mode,
+ _status.SessionKey,
+ _status.LastError);
+ }
+ else if (_status.Running)
{
_status = BuildRunningStatus(
_runtimeModeOverride ?? _settings.Voice.Mode,
@@ -109,6 +128,59 @@ public Task<VoiceStatusInfo> GetStatusAsync()
}
}
+ public async Task<VoiceStatusInfo> ToggleQuickPauseAsync()
+ {
+ ObjectDisposedException.ThrowIf(_disposed, this);
+
+ VoiceActivationMode mode;
+ string? sessionKey;
+ bool shouldResume;
+
+ lock (_gate)
+ {
+ mode = _runtimeModeOverride ?? _settings.Voice.Mode;
+ sessionKey = _status.SessionKey;
+
+ if (!_settings.Voice.Enabled || mode == VoiceActivationMode.Off)
+ {
+ _quickPaused = false;
+ _status = BuildStoppedStatus(sessionKey, "Voice mode is disabled");
+ return Clone(_status);
+ }
+
+ shouldResume = _quickPaused || _status.State == VoiceRuntimeState.Paused;
+ if (!shouldResume)
+ {
+ _quickPaused = true;
+ }
+ }
+
+ if (shouldResume)
+ {
+ lock (_gate)
+ {
+ _quickPaused = false;
+ }
+
+ var resumed = await StartAsync(new VoiceStartArgs
+ {
+ Mode = mode,
+ SessionKey = sessionKey
+ });
+ _logger.Info($"Voice runtime resumed via quick toggle ({mode})");
+ return resumed;
+ }
+
+ await StopRuntimeResourcesAsync(updateStoppedStatus: false);
+
+ lock (_gate)
+ {
+ _status = BuildPausedStatus(mode, sessionKey, null);
+ _logger.Info($"Voice runtime paused via quick toggle ({mode})");
+ return Clone(_status);
+ }
+ }
+
public async Task<VoiceStatusInfo> StartAsync(VoiceStartArgs args)
{
ObjectDisposedException.ThrowIf(_disposed, this);
@@ -138,9 +210,16 @@ public async Task<VoiceStatusInfo> StartAsync(VoiceStartArgs args)
if (!effectiveSettings.Enabled || requestedMode == VoiceActivationMode.Off)
{
+ _quickPaused = false;
_status = BuildStoppedStatus(sessionKey, "Voice mode is disabled");
return Clone(_status);
}
+
+ if (_quickPaused)
+ {
+ _status = BuildPausedStatus(requestedMode, sessionKey, _status.LastError);
+ return Clone(_status);
+ }
}
await StopRuntimeResourcesAsync(updateStoppedStatus: false);
@@ -191,6 +270,7 @@ public async Task<VoiceStatusInfo> StopAsync(VoiceStopArgs args)
lock (_gate)
{
+ _quickPaused = false;
_runtimeModeOverride = null;
_status = BuildStoppedStatus(_status.SessionKey, args.Reason);
_logger.Info($"Voice runtime stopped{(string.IsNullOrWhiteSpace(args.Reason) ? string.Empty : $": {args.Reason}")}");
@@ -277,6 +357,7 @@ public void Dispose()
private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? sessionKey)
{
+ var effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? "main" : sessionKey;
var selectedSpeechToText = VoiceProviderCatalogService.ResolveSpeechToTextProvider(
settings.SpeechToTextProviderId,
_logger);
@@ -302,6 +383,7 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
_logger.Warn("Selected output device is saved, but AlwaysOn currently uses the default speech output device.");
}
+ recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
@@ -313,7 +395,7 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
_mediaPlayer = player;
_status = BuildRunningStatus(
VoiceActivationMode.AlwaysOn,
- sessionKey,
+ effectiveSessionKey,
VoiceRuntimeState.Arming,
fallbackMessage);
}
@@ -327,7 +409,7 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
{
_status = BuildRunningStatus(
VoiceActivationMode.AlwaysOn,
- sessionKey,
+ effectiveSessionKey,
VoiceRuntimeState.ListeningContinuously,
fallbackMessage);
}
@@ -392,7 +474,7 @@ private async Task EnsureChatTransportAsync(CancellationToken cancellationToken)
{
_chatClient = new OpenClawGatewayClient(_settings.GatewayUrl, _settings.Token, _logger);
_chatClient.StatusChanged += OnChatTransportStatusChanged;
- _chatClient.NotificationReceived += OnChatNotificationReceived;
+ _chatClient.ChatMessageReceived += OnChatMessageReceived;
existingClient = _chatClient;
_chatTransportStatus = ConnectionStatus.Connecting;
}
@@ -510,12 +592,41 @@ private async void OnSpeechResultGenerated(
}
}
- private async Task HandleRecognizedTextAsync(string text)
+ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognitionHypothesisGeneratedEventArgs args)
{
- CancellationToken cancellationToken;
+ string? sessionKey = null;
+ string? text = null;
lock (_gate)
{
+ if (_runtimeCts == null ||
+ _status.Mode != VoiceActivationMode.AlwaysOn ||
+ !_status.Running ||
+ _awaitingReply ||
+ _isSpeaking)
+ {
+ return;
+ }
+
+ text = args.Hypothesis?.Text?.Trim();
+ sessionKey = GetCurrentVoiceSessionKey();
+ }
+
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return;
+ }
+
+ RaiseTranscriptDraft(text, sessionKey, clear: false);
+ }
+
+ private async Task HandleRecognizedTextAsync(string text)
+ {
+ CancellationToken cancellationToken;
+ string sessionKey;
+
+ lock (_gate)
+ {
if (_runtimeCts == null || _status.Mode != VoiceActivationMode.AlwaysOn || !_status.Running)
{
return;
@@ -534,15 +645,11 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscript = text;
_lastTranscriptUtc = DateTime.UtcNow;
- _awaitingReply = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- _status.SessionKey,
- VoiceRuntimeState.AwaitingResponse,
- _status.LastError);
- _status.LastUtteranceUtc = DateTime.UtcNow;
cancellationToken = _runtimeCts.Token;
- }
+ sessionKey = GetCurrentVoiceSessionKey();
+ }
+
+ RaiseTranscriptDraft(text, sessionKey, clear: false);
await StopRecognitionSessionAsync();
@@ -551,9 +658,11 @@ private async Task HandleRecognizedTextAsync(string text)
await EnsureChatTransportAsync(cancellationToken);
OpenClawGatewayClient? client;
+ Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? transcriptSubmitter;
lock (_gate)
{
client = _chatClient;
+ transcriptSubmitter = TranscriptSubmitter;
}
if (client == null)
@@ -562,7 +671,50 @@ private async Task HandleRecognizedTextAsync(string text)
}
_logger.Info($"Voice transcript captured: {text}");
- await client.SendChatMessageAsync(text);
+ var submitOutcome = VoiceTranscriptSubmitOutcome.Unavailable;
+ if (transcriptSubmitter != null)
+ {
+ submitOutcome = await transcriptSubmitter(text, sessionKey);
+ }
+
+ if (submitOutcome == VoiceTranscriptSubmitOutcome.Unavailable)
+ {
+ await client.SendChatMessageAsync(text, sessionKey);
+ submitOutcome = VoiceTranscriptSubmitOutcome.Submitted;
+ }
+
+ if (submitOutcome == VoiceTranscriptSubmitOutcome.DeferredToUser)
+ {
+ lock (_gate)
+ {
+ _awaitingReply = false;
+ _pendingManualTranscript = text;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.PendingManualSend,
+ "Draft ready in tray chat window. Send it manually to continue.");
+ _status.LastUtteranceUtc = DateTime.UtcNow;
+ }
+
+ RaiseTranscriptDraft(text, sessionKey, clear: false);
+ return;
+ }
+
+ lock (_gate)
+ {
+ _awaitingReply = true;
+ _pendingManualTranscript = null;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.AwaitingResponse,
+ _status.LastError);
+ _status.LastUtteranceUtc = DateTime.UtcNow;
+ }
+
+ RaiseConversationTurn(VoiceConversationDirection.Outgoing, text, sessionKey);
+ RaiseTranscriptDraft(string.Empty, sessionKey, clear: true);
_ = MonitorReplyTimeoutAsync(text, cancellationToken);
}
catch (Exception ex)
@@ -583,6 +735,35 @@ private async Task HandleRecognizedTextAsync(string text)
}
}
+ public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null)
+ {
+ CancellationToken cancellationToken;
+ string effectiveSessionKey;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _status.Mode != VoiceActivationMode.AlwaysOn || !_status.Running)
+ {
+ return;
+ }
+
+ cancellationToken = _runtimeCts.Token;
+ effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? GetCurrentVoiceSessionKey() : sessionKey!;
+ _pendingManualTranscript = null;
+ _awaitingReply = true;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.AwaitingResponse,
+ _status.LastError);
+ _status.LastUtteranceUtc = DateTime.UtcNow;
+ }
+
+ RaiseConversationTurn(VoiceConversationDirection.Outgoing, text, effectiveSessionKey);
+ RaiseTranscriptDraft(string.Empty, effectiveSessionKey, clear: true);
+ _ = MonitorReplyTimeoutAsync(text, cancellationToken);
+ }
+
private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken cancellationToken)
{
try
@@ -616,9 +797,11 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
}
}
- private async void OnChatNotificationReceived(object? sender, OpenClawNotification notification)
+ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs args)
{
- if (!notification.IsChat || string.IsNullOrWhiteSpace(notification.Message))
+ if (!args.IsFinal ||
+ !string.Equals(args.Role, "assistant", StringComparison.OrdinalIgnoreCase) ||
+ string.IsNullOrWhiteSpace(args.Message))
{
return;
}
@@ -632,18 +815,43 @@ private async void OnChatNotificationReceived(object? sender, OpenClawNotificati
return;
}
+ if (!IsMatchingSessionKey(args.SessionKey, GetCurrentVoiceSessionKey()))
+ {
+ return;
+ }
+
_awaitingReply = false;
_isSpeaking = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- _status.SessionKey,
- VoiceRuntimeState.PlayingResponse,
- _status.LastError);
- text = notification.Message;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
+ text = PrepareReplyForSpeech(args.Message);
+ }
+
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ lock (_gate)
+ {
+ _isSpeaking = false;
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ }
+ }
+
+ await StartRecognitionSessionAsync();
+ return;
}
try
{
+ RaiseConversationTurn(VoiceConversationDirection.Incoming, text, args.SessionKey);
await SpeakTextAsync(text);
}
catch (Exception ex)
@@ -850,12 +1058,14 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_awaitingReply = false;
_isSpeaking = false;
+ _pendingManualTranscript = null;
}
try { runtimeCts?.Cancel(); } catch { }
if (recognizer != null)
{
+ recognizer.HypothesisGenerated -= OnSpeechHypothesisGenerated;
recognizer.ContinuousRecognitionSession.ResultGenerated -= OnSpeechResultGenerated;
recognizer.ContinuousRecognitionSession.Completed -= OnSpeechRecognitionCompleted;
@@ -875,7 +1085,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
if (chatClient != null)
{
chatClient.StatusChanged -= OnChatTransportStatusChanged;
- chatClient.NotificationReceived -= OnChatNotificationReceived;
+ chatClient.ChatMessageReceived -= OnChatMessageReceived;
try { await chatClient.DisconnectAsync(); } catch { }
try { chatClient.Dispose(); } catch { }
}
@@ -889,6 +1099,70 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_status = BuildStoppedStatus(sessionKey, "Disposed");
}
}
+
+ RaiseTranscriptDraft(string.Empty, sessionKey, clear: true);
+ }
+
+ private string GetCurrentVoiceSessionKey()
+ {
+ return string.IsNullOrWhiteSpace(_status.SessionKey) ? "main" : _status.SessionKey!;
+ }
+
+ private static bool IsMatchingSessionKey(string? actualSessionKey, string? expectedSessionKey)
+ {
+ actualSessionKey = string.IsNullOrWhiteSpace(actualSessionKey) ? "main" : actualSessionKey;
+ expectedSessionKey = string.IsNullOrWhiteSpace(expectedSessionKey) ? "main" : expectedSessionKey;
+
+ if (string.Equals(actualSessionKey, expectedSessionKey, StringComparison.Ordinal))
+ {
+ return true;
+ }
+
+ return IsMainSessionKey(actualSessionKey) && IsMainSessionKey(expectedSessionKey);
+ }
+
+ private static bool IsMainSessionKey(string sessionKey)
+ {
+ return sessionKey == "main" || sessionKey.Contains(":main:", StringComparison.Ordinal);
+ }
+
+ private static string PrepareReplyForSpeech(string text)
+ {
+ var trimmed = text.Trim();
+ if (string.IsNullOrWhiteSpace(trimmed))
+ {
+ return string.Empty;
+ }
+
+ var firstNewline = trimmed.IndexOf('\n');
+ if (firstNewline <= 0)
+ {
+ return trimmed;
+ }
+
+ var firstLine = trimmed[..firstNewline].Trim();
+ if (!firstLine.StartsWith("{", StringComparison.Ordinal))
+ {
+ return trimmed;
+ }
+
+ try
+ {
+ using var doc = JsonDocument.Parse(firstLine);
+ if (doc.RootElement.ValueKind != JsonValueKind.Object ||
+ !doc.RootElement.TryGetProperty("voice", out _) &&
+ !doc.RootElement.TryGetProperty("voiceId", out _) &&
+ !doc.RootElement.TryGetProperty("voice_id", out _))
+ {
+ return trimmed;
+ }
+
+ return trimmed[(firstNewline + 1)..].TrimStart();
+ }
+ catch (JsonException)
+ {
+ return trimmed;
+ }
}
private VoiceStatusInfo BuildRunningStatus(
@@ -935,6 +1209,26 @@ private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
};
}
+ private VoiceStatusInfo BuildPausedStatus(VoiceActivationMode mode, string? sessionKey, string? reason)
+ {
+ var settings = _settings.Voice;
+ return new VoiceStatusInfo
+ {
+ Available = true,
+ Running = false,
+ Mode = mode,
+ State = VoiceRuntimeState.Paused,
+ SessionKey = sessionKey,
+ InputDeviceId = settings.InputDeviceId,
+ OutputDeviceId = settings.OutputDeviceId,
+ WakeWordModelId = settings.WakeWord.ModelId,
+ WakeWordLoaded = false,
+ LastWakeWordUtc = _status.LastWakeWordUtc,
+ LastUtteranceUtc = _status.LastUtteranceUtc,
+ LastError = reason
+ };
+ }
+
private VoiceStatusInfo BuildErrorStatus(VoiceActivationMode mode, string? sessionKey, string? reason)
{
var status = BuildRunningStatus(mode, sessionKey, VoiceRuntimeState.Error, reason);
@@ -948,6 +1242,7 @@ private static VoiceSettings Clone(VoiceSettings source)
{
Mode = source.Mode,
Enabled = source.Enabled,
+ ShowConversationToasts = source.ShowConversationToasts,
SpeechToTextProviderId = source.SpeechToTextProviderId,
TextToSpeechProviderId = source.TextToSpeechProviderId,
InputDeviceId = source.InputDeviceId,
@@ -969,7 +1264,8 @@ private static VoiceSettings Clone(VoiceSettings source)
MinSpeechMs = source.AlwaysOn.MinSpeechMs,
EndSilenceMs = source.AlwaysOn.EndSilenceMs,
MaxUtteranceMs = source.AlwaysOn.MaxUtteranceMs,
- AutoSubmit = source.AlwaysOn.AutoSubmit
+ AutoSubmit = source.AlwaysOn.AutoSubmit,
+ ChatWindowSubmitMode = source.AlwaysOn.ChatWindowSubmitMode
}
};
}
@@ -1037,4 +1333,60 @@ private static bool IsSpeechPrivacyDeclined(Exception ex)
return ex.Message.Contains("speech privacy policy", StringComparison.OrdinalIgnoreCase) ||
ex.Message.Contains("online speech recognition", StringComparison.OrdinalIgnoreCase);
}
+
+ private void RaiseConversationTurn(VoiceConversationDirection direction, string text, string? sessionKey)
+ {
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return;
+ }
+
+ ConversationTurnAvailable?.Invoke(this, new VoiceConversationTurnEventArgs
+ {
+ Direction = direction,
+ Message = text,
+ SessionKey = string.IsNullOrWhiteSpace(sessionKey) ? "main" : sessionKey,
+ Mode = _runtimeModeOverride ?? _settings.Voice.Mode
+ });
+ }
+
+ private void RaiseTranscriptDraft(string text, string? sessionKey, bool clear)
+ {
+ TranscriptDraftUpdated?.Invoke(this, new VoiceTranscriptDraftEventArgs
+ {
+ SessionKey = string.IsNullOrWhiteSpace(sessionKey) ? "main" : sessionKey,
+ Text = clear ? string.Empty : text,
+ Clear = clear,
+ Mode = _runtimeModeOverride ?? _settings.Voice.Mode
+ });
+ }
+}
+
+public enum VoiceConversationDirection
+{
+ Outgoing,
+ Incoming
+}
+
+public sealed class VoiceConversationTurnEventArgs : EventArgs
+{
+ public VoiceConversationDirection Direction { get; set; }
+ public string SessionKey { get; set; } = "main";
+ public string Message { get; set; } = "";
+ public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
+}
+
+public sealed class VoiceTranscriptDraftEventArgs : EventArgs
+{
+ public string SessionKey { get; set; } = "main";
+ public string Text { get; set; } = "";
+ public bool Clear { get; set; }
+ public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
+}
+
+public enum VoiceTranscriptSubmitOutcome
+{
+ Unavailable,
+ Submitted,
+ DeferredToUser
}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 8a6bc4b..49daa78 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -7,6 +7,7 @@
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
+using System.Text.Json;
using System.Threading.Tasks;
using WinUIEx;
using Windows.Foundation;
@@ -17,12 +18,171 @@ public sealed partial class WebChatWindow : WindowEx
{
private readonly string _gatewayUrl;
private readonly string _token;
+ private string _pendingVoiceDraft = string.Empty;
// Store event handlers for cleanup
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationCompletedEventArgs>? _navigationCompletedHandler;
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationStartingEventArgs>? _navigationStartingHandler;
+ private TypedEventHandler<CoreWebView2, CoreWebView2WebMessageReceivedEventArgs>? _webMessageReceivedHandler;
public bool IsClosed { get; private set; }
+ public event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
+
+ private const string TrayVoiceIntegrationScript = """
+(() => {
+ const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
+ const sanitize = (value) => typeof value === 'string' ? value.replace(memoryPattern, '').trimStart() : value;
+ const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
+ let desiredDraft = '';
+ const findComposer = () => {
+ const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
+ return candidates.find(isVisible) || null;
+ };
+ const setElementValue = (el, value) => {
+ if ('value' in el) {
+ const proto = el.tagName === 'TEXTAREA' ? HTMLTextAreaElement.prototype : HTMLInputElement.prototype;
+ const descriptor = Object.getOwnPropertyDescriptor(proto, 'value');
+ if (descriptor && descriptor.set) {
+ descriptor.set.call(el, value);
+ } else {
+ el.value = value;
+ }
+ el.dispatchEvent(new Event('input', { bubbles: true }));
+ el.dispatchEvent(new Event('change', { bubbles: true }));
+ return;
+ }
+ if (el.isContentEditable) {
+ el.textContent = value;
+ el.dispatchEvent(new Event('input', { bubbles: true }));
+ el.dispatchEvent(new Event('change', { bubbles: true }));
+ }
+ };
+ const applyDraftIfPossible = () => {
+ const composer = findComposer();
+ if (!composer) return false;
+ setElementValue(composer, desiredDraft);
+ return true;
+ };
+ const findSendButton = () => {
+ const buttons = Array.from(document.querySelectorAll('button, [role="button"], input[type="submit"]'));
+ return buttons.find((button) => {
+ if (!isVisible(button)) return false;
+ if (button.disabled === true || button.getAttribute('aria-disabled') === 'true') return false;
+ const text = ((button.innerText || button.textContent || '') + ' ' + (button.getAttribute('aria-label') || '')).trim().toLowerCase();
+ return text === 'send' || text.startsWith('send ') || text.includes('send ↵') || text.includes('send');
+ }) || null;
+ };
+ const submitDraft = (text) => {
+ desiredDraft = sanitize(text || '');
+ pendingManual = false;
+ const composer = findComposer();
+ if (!composer) return false;
+ setElementValue(composer, desiredDraft);
+ const sendButton = findSendButton();
+ if (sendButton) {
+ sendButton.click();
+ desiredDraft = '';
+ return true;
+ }
+ composer.dispatchEvent(new KeyboardEvent('keydown', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
+ composer.dispatchEvent(new KeyboardEvent('keyup', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
+ desiredDraft = '';
+ return true;
+ };
+ let pendingManual = false;
+ const emitManualSubmit = () => {
+ if (!pendingManual) return;
+ const composer = findComposer();
+ if (!composer) return;
+ const current = sanitize(('value' in composer ? composer.value : composer.textContent) || '');
+ if (!current) return;
+ pendingManual = false;
+ desiredDraft = '';
+ if (window.chrome?.webview?.postMessage) {
+ window.chrome.webview.postMessage(JSON.stringify({ type: 'voice-manual-submit', text: current }));
+ }
+ };
+ const cleanTextNodes = () => {
+ if (!document.body) return;
+ const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
+ const nodes = [];
+ let current;
+ while ((current = walker.nextNode())) {
+ nodes.push(current);
+ }
+ for (const node of nodes) {
+ if (!node || !node.parentElement) continue;
+ const tag = node.parentElement.tagName;
+ if (tag === 'SCRIPT' || tag === 'STYLE' || tag === 'TEXTAREA') continue;
+ const original = node.textContent || '';
+ const cleaned = sanitize(original);
+ if (cleaned !== original) {
+ node.textContent = cleaned;
+ }
+ }
+ };
+ let cleanScheduled = false;
+ const scheduleClean = () => {
+ if (cleanScheduled) return;
+ cleanScheduled = true;
+ queueMicrotask(() => {
+ cleanScheduled = false;
+ cleanTextNodes();
+ applyDraftIfPossible();
+ });
+ };
+ const observer = new MutationObserver(() => scheduleClean());
+ const start = () => {
+ if (!document.body) return;
+ observer.observe(document.body, { childList: true, subtree: true, characterData: true });
+ document.addEventListener('click', (event) => {
+ const target = event.target instanceof Element ? event.target.closest('button, [role="button"], input[type="submit"]') : null;
+ if (!target) return;
+ const sendButton = findSendButton();
+ if (sendButton && target === sendButton) {
+ emitManualSubmit();
+ }
+ }, true);
+ document.addEventListener('keydown', (event) => {
+ if (event.key !== 'Enter' || event.shiftKey) return;
+ const composer = findComposer();
+ if (!composer) return;
+ if (event.target === composer) {
+ emitManualSubmit();
+ }
+ }, true);
+ scheduleClean();
+ };
+ if (document.readyState === 'loading') {
+ document.addEventListener('DOMContentLoaded', start, { once: true });
+ } else {
+ start();
+ }
+ window.__openClawTrayVoice = {
+ setDraft(text) {
+ desiredDraft = sanitize(text || '');
+ return applyDraftIfPossible();
+ },
+ prepareManualDraft(text) {
+ desiredDraft = sanitize(text || '');
+ pendingManual = true;
+ return applyDraftIfPossible();
+ },
+ clearDraft() {
+ desiredDraft = '';
+ pendingManual = false;
+ return applyDraftIfPossible();
+ },
+ submitDraft(text) {
+ return submitDraft(text);
+ },
+ stripInjectedMemories() {
+ scheduleClean();
+ return true;
+ }
+ };
+})();
+""";
public WebChatWindow(string gatewayUrl, string token)
{
@@ -56,6 +216,8 @@ private void OnWindowClosed(object sender, WindowEventArgs e)
WebView.CoreWebView2.NavigationCompleted -= _navigationCompletedHandler;
if (_navigationStartingHandler != null)
WebView.CoreWebView2.NavigationStarting -= _navigationStartingHandler;
+ if (_webMessageReceivedHandler != null)
+ WebView.CoreWebView2.WebMessageReceived -= _webMessageReceivedHandler;
}
}
@@ -84,6 +246,7 @@ private async Task InitializeWebViewAsync()
WebView.CoreWebView2.Settings.IsStatusBarEnabled = false;
WebView.CoreWebView2.Settings.AreDefaultContextMenusEnabled = true;
WebView.CoreWebView2.Settings.IsZoomControlEnabled = true;
+ await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(TrayVoiceIntegrationScript);
// Handle navigation events (store for cleanup)
_navigationCompletedHandler = (s, e) =>
@@ -91,6 +254,7 @@ private async Task InitializeWebViewAsync()
Logger.Info($"WebChatWindow: Navigation completed, success={e.IsSuccess}, status={e.WebErrorStatus}");
LoadingRing.IsActive = false;
LoadingRing.Visibility = Visibility.Collapsed;
+ _ = RefreshTrayVoiceDomStateAsync();
// Show friendly error if connection failed
if (!e.IsSuccess && (e.WebErrorStatus == CoreWebView2WebErrorStatus.ConnectionAborted ||
@@ -123,6 +287,38 @@ private async Task InitializeWebViewAsync()
};
WebView.CoreWebView2.NavigationStarting += _navigationStartingHandler;
+ _webMessageReceivedHandler = (s, e) =>
+ {
+ try
+ {
+ using var doc = JsonDocument.Parse(e.TryGetWebMessageAsString());
+ if (!doc.RootElement.TryGetProperty("type", out var typeProp) ||
+ !string.Equals(typeProp.GetString(), "voice-manual-submit", StringComparison.Ordinal))
+ {
+ return;
+ }
+
+ var text = doc.RootElement.TryGetProperty("text", out var textProp)
+ ? textProp.GetString() ?? string.Empty
+ : string.Empty;
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return;
+ }
+
+ VoiceTranscriptSubmitted?.Invoke(this, new VoiceTranscriptSubmittedEventArgs
+ {
+ Text = text,
+ SessionKey = "main"
+ });
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"WebChatWindow: Failed to process voice web message: {ex.Message}");
+ }
+ };
+ WebView.CoreWebView2.WebMessageReceived += _webMessageReceivedHandler;
+
// Navigate to chat
NavigateToChat();
}
@@ -266,4 +462,82 @@ private void OnDevTools(object sender, RoutedEventArgs e)
{
WebView.CoreWebView2?.OpenDevToolsWindow();
}
+
+ public async Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
+ {
+ _pendingVoiceDraft = clear ? string.Empty : (text ?? string.Empty);
+ await RefreshTrayVoiceDomStateAsync();
+ }
+
+ public async Task<bool> TrySubmitVoiceTranscriptAsync(string text)
+ {
+ if (WebView.CoreWebView2 == null)
+ {
+ return false;
+ }
+
+ try
+ {
+ var textJson = JsonSerializer.Serialize(text ?? string.Empty);
+ var result = await WebView.CoreWebView2.ExecuteScriptAsync(
+ $"window.__openClawTrayVoice?.submitDraft?.({textJson}) ?? false;");
+ return string.Equals(result, "true", StringComparison.OrdinalIgnoreCase);
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"WebChatWindow: Failed to submit voice draft through chat UI: {ex.Message}");
+ return false;
+ }
+ }
+
+ public async Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text)
+ {
+ if (WebView.CoreWebView2 == null)
+ {
+ return false;
+ }
+
+ try
+ {
+ var textJson = JsonSerializer.Serialize(text ?? string.Empty);
+ var result = await WebView.CoreWebView2.ExecuteScriptAsync(
+ $"window.__openClawTrayVoice?.prepareManualDraft?.({textJson}) ?? false;");
+ return string.Equals(result, "true", StringComparison.OrdinalIgnoreCase);
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"WebChatWindow: Failed to prepare manual voice draft: {ex.Message}");
+ return false;
+ }
+ }
+
+ private async Task RefreshTrayVoiceDomStateAsync()
+ {
+ if (WebView.CoreWebView2 == null)
+ {
+ return;
+ }
+
+ try
+ {
+ await WebView.CoreWebView2.ExecuteScriptAsync("window.__openClawTrayVoice?.stripInjectedMemories?.();");
+
+ var draftJson = JsonSerializer.Serialize(_pendingVoiceDraft ?? string.Empty);
+ var script = string.IsNullOrWhiteSpace(_pendingVoiceDraft)
+ ? "window.__openClawTrayVoice?.clearDraft?.();"
+ : $"window.__openClawTrayVoice?.setDraft?.({draftJson});";
+
+ await WebView.CoreWebView2.ExecuteScriptAsync(script);
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"WebChatWindow: Failed to apply voice DOM state: {ex.Message}");
+ }
+ }
+}
+
+public sealed class VoiceTranscriptSubmittedEventArgs : EventArgs
+{
+ public string Text { get; set; } = "";
+ public string SessionKey { get; set; } = "main";
}
From 1340bde76878a6377fa0b93ac03cb5ac9c1bd64e Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 00:35:20 +0000
Subject: [PATCH 05/83] Fix tray voice startup and chat window submission
---
.../Services/NodeService.cs | 42 ++++++++++++-
.../Windows/WebChatWindow.xaml.cs | 60 ++++++++++++++-----
2 files changed, 85 insertions(+), 17 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
index 62ea080..c005400 100644
--- a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
@@ -88,7 +88,26 @@ public async Task ConnectAsync(string gatewayUrl, string token)
var settings = await _voiceService.GetSettingsAsync();
if (settings.Enabled && settings.Mode != VoiceActivationMode.Off)
{
- await _voiceService.StartAsync(new VoiceStartArgs { Mode = settings.Mode });
+ var startTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ var enqueued = _dispatcherQueue.TryEnqueue(async () =>
+ {
+ try
+ {
+ await _voiceService.StartAsync(new VoiceStartArgs { Mode = settings.Mode });
+ startTcs.TrySetResult(true);
+ }
+ catch (Exception ex)
+ {
+ startTcs.TrySetException(ex);
+ }
+ });
+
+ if (!enqueued)
+ {
+ throw new InvalidOperationException("Dispatcher queue unavailable for voice startup.");
+ }
+
+ await startTcs.Task;
}
}
}
@@ -107,7 +126,26 @@ public async Task DisconnectAsync()
if (_voiceService != null)
{
- await _voiceService.StopAsync(new VoiceStopArgs { Reason = "Node disconnected" });
+ var stopTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ var enqueued = _dispatcherQueue.TryEnqueue(async () =>
+ {
+ try
+ {
+ await _voiceService.StopAsync(new VoiceStopArgs { Reason = "Node disconnected" });
+ stopTcs.TrySetResult(true);
+ }
+ catch (Exception ex)
+ {
+ stopTcs.TrySetException(ex);
+ }
+ });
+
+ if (!enqueued)
+ {
+ throw new InvalidOperationException("Dispatcher queue unavailable for voice shutdown.");
+ }
+
+ await stopTcs.Task;
}
// Close canvas window
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 49daa78..d4900b5 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -33,11 +33,19 @@ public sealed partial class WebChatWindow : WindowEx
const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
const sanitize = (value) => typeof value === 'string' ? value.replace(memoryPattern, '').trimStart() : value;
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
+ const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
let desiredDraft = '';
const findComposer = () => {
const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
return candidates.find(isVisible) || null;
};
+ const getComposerValue = (composer) => sanitize(('value' in composer ? composer.value : composer.textContent) || '');
+ const isSendLike = (button) => {
+ if (!button || !isVisible(button)) return false;
+ if (button.disabled === true || button.getAttribute('aria-disabled') === 'true') return false;
+ const text = ((button.innerText || button.textContent || '') + ' ' + (button.getAttribute('aria-label') || '')).trim().toLowerCase();
+ return text === 'send' || text.startsWith('send ') || text.includes('send ↵') || text.includes('send');
+ };
const setElementValue = (el, value) => {
if ('value' in el) {
const proto = el.tagName === 'TEXTAREA' ? HTMLTextAreaElement.prototype : HTMLInputElement.prototype;
@@ -47,13 +55,13 @@ public sealed partial class WebChatWindow : WindowEx
} else {
el.value = value;
}
- el.dispatchEvent(new Event('input', { bubbles: true }));
+ el.dispatchEvent(new InputEvent('input', { bubbles: true, data: value, inputType: 'insertText' }));
el.dispatchEvent(new Event('change', { bubbles: true }));
return;
}
if (el.isContentEditable) {
el.textContent = value;
- el.dispatchEvent(new Event('input', { bubbles: true }));
+ el.dispatchEvent(new InputEvent('input', { bubbles: true, data: value, inputType: 'insertText' }));
el.dispatchEvent(new Event('change', { bubbles: true }));
}
};
@@ -65,29 +73,44 @@ public sealed partial class WebChatWindow : WindowEx
};
const findSendButton = () => {
const buttons = Array.from(document.querySelectorAll('button, [role="button"], input[type="submit"]'));
- return buttons.find((button) => {
- if (!isVisible(button)) return false;
- if (button.disabled === true || button.getAttribute('aria-disabled') === 'true') return false;
- const text = ((button.innerText || button.textContent || '') + ' ' + (button.getAttribute('aria-label') || '')).trim().toLowerCase();
- return text === 'send' || text.startsWith('send ') || text.includes('send ↵') || text.includes('send');
- }) || null;
+ return buttons.find(isSendLike) || null;
};
- const submitDraft = (text) => {
+ const waitForDraftToLeaveComposer = async (expectedText) => {
+ for (let i = 0; i < 20; i++) {
+ await delay(75);
+ const composer = findComposer();
+ if (!composer) return true;
+ const current = getComposerValue(composer);
+ if (!current || current !== sanitize(expectedText || '')) {
+ return true;
+ }
+ }
+ return false;
+ };
+ const submitDraft = async (text) => {
desiredDraft = sanitize(text || '');
pendingManual = false;
const composer = findComposer();
if (!composer) return false;
setElementValue(composer, desiredDraft);
+ await delay(0);
const sendButton = findSendButton();
if (sendButton) {
sendButton.click();
- desiredDraft = '';
- return true;
+ const sent = await waitForDraftToLeaveComposer(desiredDraft);
+ if (sent) {
+ desiredDraft = '';
+ }
+ return sent;
}
composer.dispatchEvent(new KeyboardEvent('keydown', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
+ composer.dispatchEvent(new KeyboardEvent('keypress', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
composer.dispatchEvent(new KeyboardEvent('keyup', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
- desiredDraft = '';
- return true;
+ const sent = await waitForDraftToLeaveComposer(desiredDraft);
+ if (sent) {
+ desiredDraft = '';
+ }
+ return sent;
};
let pendingManual = false;
const emitManualSubmit = () => {
@@ -138,8 +161,15 @@ public sealed partial class WebChatWindow : WindowEx
document.addEventListener('click', (event) => {
const target = event.target instanceof Element ? event.target.closest('button, [role="button"], input[type="submit"]') : null;
if (!target) return;
- const sendButton = findSendButton();
- if (sendButton && target === sendButton) {
+ if (isSendLike(target)) {
+ emitManualSubmit();
+ }
+ }, true);
+ document.addEventListener('submit', (event) => {
+ const form = event.target instanceof HTMLFormElement ? event.target : null;
+ const composer = findComposer();
+ if (!form || !composer) return;
+ if (form.contains(composer)) {
emitManualSubmit();
}
}, true);
From aed8cb84be73f209d1a6d7b6c98834829854a57a Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 00:45:46 +0000
Subject: [PATCH 06/83] Remove stale always-on autosubmit setting
---
docs/VOICE-MODE.md | 2 --
src/OpenClaw.Shared/VoiceModeSchema.cs | 1 -
src/OpenClaw.Tray.WinUI/Services/VoiceService.cs | 1 -
src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs | 1 -
tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs | 1 -
tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs | 2 --
6 files changed, 8 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 5347acd..6cebad9 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -253,7 +253,6 @@ Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](C:/dev
"MinSpeechMs": 250,
"EndSilenceMs": 900,
"MaxUtteranceMs": 15000,
- "AutoSubmit": true,
"ChatWindowSubmitMode": "AutoSend"
}
}
@@ -297,7 +296,6 @@ Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](C:/dev
| `Voice.AlwaysOn.MinSpeechMs` | int | `250` | always-on | Minimum detected speech duration before an utterance is treated as real input |
| `Voice.AlwaysOn.EndSilenceMs` | int | `900` | always-on | Silence timeout used to finalize an utterance |
| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
-| `Voice.AlwaysOn.AutoSubmit` | bool | `true` | always-on | If `true`, completed utterances are submitted immediately without extra confirmation |
| `Voice.AlwaysOn.ChatWindowSubmitMode` | enum | `AutoSend` | always-on | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index c160f5e..3eaa0b4 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -88,7 +88,6 @@ public sealed class VoiceAlwaysOnSettings
public int MinSpeechMs { get; set; } = 250;
public int EndSilenceMs { get; set; } = 900;
public int MaxUtteranceMs { get; set; } = 15000;
- public bool AutoSubmit { get; set; } = true;
public VoiceChatWindowSubmitMode ChatWindowSubmitMode { get; set; } = VoiceChatWindowSubmitMode.AutoSend;
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
index 5eb1367..592bdc5 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
@@ -1264,7 +1264,6 @@ private static VoiceSettings Clone(VoiceSettings source)
MinSpeechMs = source.AlwaysOn.MinSpeechMs,
EndSilenceMs = source.AlwaysOn.EndSilenceMs,
MaxUtteranceMs = source.AlwaysOn.MaxUtteranceMs,
- AutoSubmit = source.AlwaysOn.AutoSubmit,
ChatWindowSubmitMode = source.AlwaysOn.ChatWindowSubmitMode
}
};
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index ade1a83..334f3eb 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -311,7 +311,6 @@ private async void OnSave(object sender, RoutedEventArgs e)
MinSpeechMs = _settings.Voice.AlwaysOn.MinSpeechMs,
EndSilenceMs = _settings.Voice.AlwaysOn.EndSilenceMs,
MaxUtteranceMs = _settings.Voice.AlwaysOn.MaxUtteranceMs,
- AutoSubmit = _settings.Voice.AlwaysOn.AutoSubmit,
ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
}
};
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 8ec7478..45b489d 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -40,7 +40,6 @@ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
Assert.Equal("hey_openclaw", settings.WakeWord.ModelId);
Assert.Equal(0.65f, settings.WakeWord.TriggerThreshold);
Assert.Equal(250, settings.AlwaysOn.MinSpeechMs);
- Assert.True(settings.AlwaysOn.AutoSubmit);
Assert.Equal(VoiceChatWindowSubmitMode.AutoSend, settings.AlwaysOn.ChatWindowSubmitMode);
}
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 0c5ba4f..22e8306 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -54,7 +54,6 @@ public void RoundTrip_AllFields_Preserved()
MinSpeechMs = 300,
EndSilenceMs = 1100,
MaxUtteranceMs = 18000,
- AutoSubmit = false,
ChatWindowSubmitMode = VoiceChatWindowSubmitMode.WaitForUser
}
},
@@ -98,7 +97,6 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal("hey_openclaw", restored.Voice.WakeWord.ModelId);
Assert.Equal(0.72f, restored.Voice.WakeWord.TriggerThreshold);
Assert.Equal(300, restored.Voice.AlwaysOn.MinSpeechMs);
- Assert.False(restored.Voice.AlwaysOn.AutoSubmit);
Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.AlwaysOn.ChatWindowSubmitMode);
Assert.NotNull(restored.UserRules);
Assert.Single(restored.UserRules);
From 25dd06bd811001f6d4739c02eab05a5675250166 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 01:00:01 +0000
Subject: [PATCH 07/83] Add focused coordinator coverage for tray voice chat
---
src/OpenClaw.Tray.WinUI/App.xaml.cs | 5 +-
.../Services/VoiceChatContracts.cs | 48 +++++
.../Services/VoiceChatCoordinator.cs | 34 ++-
.../Services/VoiceService.cs | 2 +-
.../Windows/WebChatWindow.xaml.cs | 7 +-
.../OpenClaw.Tray.Tests.csproj | 5 +-
.../VoiceChatCoordinatorTests.cs | 195 ++++++++++++++++++
7 files changed, 269 insertions(+), 27 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
create mode 100644 tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index ab6feaa..9d88d28 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -254,7 +254,10 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
// Initialize settings
_settings = new SettingsManager();
_voiceService = new VoiceService(new AppLogger(), _settings);
- _voiceChatCoordinator = new VoiceChatCoordinator(_voiceService, _settings, _dispatcherQueue!);
+ _voiceChatCoordinator = new VoiceChatCoordinator(
+ _voiceService,
+ () => _settings.Voice.AlwaysOn.ChatWindowSubmitMode,
+ new DispatcherQueueAdapter(_dispatcherQueue!));
_voiceChatCoordinator.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
// First-run check
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
new file mode 100644
index 0000000..81f6733
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
@@ -0,0 +1,48 @@
+using OpenClaw.Shared;
+using System;
+using System.Threading.Tasks;
+
+namespace OpenClawTray.Services;
+
+public interface IUiDispatcher
+{
+ bool TryEnqueue(Action callback);
+}
+
+public sealed class DispatcherQueueAdapter : IUiDispatcher
+{
+ private readonly Microsoft.UI.Dispatching.DispatcherQueue _dispatcherQueue;
+
+ public DispatcherQueueAdapter(Microsoft.UI.Dispatching.DispatcherQueue dispatcherQueue)
+ {
+ _dispatcherQueue = dispatcherQueue;
+ }
+
+ public bool TryEnqueue(Action callback)
+ {
+ return _dispatcherQueue.TryEnqueue(() => callback());
+ }
+}
+
+public interface IVoiceRuntime
+{
+ event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
+ event EventHandler<VoiceTranscriptDraftEventArgs>? TranscriptDraftUpdated;
+ Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? TranscriptSubmitter { get; set; }
+ void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null);
+}
+
+public interface IVoiceChatWindow
+{
+ bool IsClosed { get; }
+ event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
+ Task UpdateVoiceTranscriptDraftAsync(string text, bool clear);
+ Task<bool> TrySubmitVoiceTranscriptAsync(string text);
+ Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text);
+}
+
+public sealed class VoiceTranscriptSubmittedEventArgs : EventArgs
+{
+ public string Text { get; set; } = "";
+ public string SessionKey { get; set; } = "main";
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
index eb86180..695701e 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
@@ -1,6 +1,4 @@
-using Microsoft.UI.Dispatching;
using OpenClaw.Shared;
-using OpenClawTray.Windows;
using System;
using System.Threading.Tasks;
@@ -8,32 +6,32 @@ namespace OpenClawTray.Services;
public sealed class VoiceChatCoordinator : IDisposable
{
- private readonly VoiceService _voiceService;
- private readonly SettingsManager _settings;
- private readonly DispatcherQueue _dispatcherQueue;
+ private readonly IVoiceRuntime _voiceService;
+ private readonly Func<VoiceChatWindowSubmitMode> _getSubmitMode;
+ private readonly IUiDispatcher _dispatcher;
private readonly object _gate = new();
- private WebChatWindow? _webChatWindow;
+ private IVoiceChatWindow? _webChatWindow;
private string _voiceTranscriptDraftText = string.Empty;
private bool _disposed;
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
public VoiceChatCoordinator(
- VoiceService voiceService,
- SettingsManager settings,
- DispatcherQueue dispatcherQueue)
+ IVoiceRuntime voiceService,
+ Func<VoiceChatWindowSubmitMode> getSubmitMode,
+ IUiDispatcher dispatcher)
{
_voiceService = voiceService;
- _settings = settings;
- _dispatcherQueue = dispatcherQueue;
+ _getSubmitMode = getSubmitMode;
+ _dispatcher = dispatcher;
_voiceService.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
_voiceService.TranscriptDraftUpdated += OnVoiceTranscriptDraftUpdated;
_voiceService.TranscriptSubmitter = SubmitVoiceTranscriptAsync;
}
- public void AttachWindow(WebChatWindow window)
+ public void AttachWindow(IVoiceChatWindow window)
{
ArgumentNullException.ThrowIfNull(window);
@@ -58,7 +56,7 @@ public void AttachWindow(WebChatWindow window)
clear: string.IsNullOrWhiteSpace(_voiceTranscriptDraftText));
}
- public void DetachWindow(WebChatWindow? window)
+ public void DetachWindow(IVoiceChatWindow? window)
{
lock (_gate)
{
@@ -93,7 +91,7 @@ public void Dispose()
private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationTurnEventArgs args)
{
- _dispatcherQueue.TryEnqueue(() =>
+ _dispatcher.TryEnqueue(() =>
{
ConversationTurnAvailable?.Invoke(this, args);
});
@@ -101,11 +99,11 @@ private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationT
private void OnVoiceTranscriptDraftUpdated(object? sender, VoiceTranscriptDraftEventArgs args)
{
- _dispatcherQueue.TryEnqueue(() =>
+ _dispatcher.TryEnqueue(() =>
{
_voiceTranscriptDraftText = args.Clear ? string.Empty : (args.Text ?? string.Empty);
- WebChatWindow? window;
+ IVoiceChatWindow? window;
lock (_gate)
{
window = _webChatWindow;
@@ -128,7 +126,7 @@ private void OnWebChatVoiceTranscriptSubmitted(object? sender, VoiceTranscriptSu
private async Task<VoiceTranscriptSubmitOutcome> SubmitVoiceTranscriptAsync(string text, string? sessionKey)
{
- WebChatWindow? window;
+ IVoiceChatWindow? window;
lock (_gate)
{
window = _webChatWindow;
@@ -139,7 +137,7 @@ private async Task<VoiceTranscriptSubmitOutcome> SubmitVoiceTranscriptAsync(stri
return VoiceTranscriptSubmitOutcome.Unavailable;
}
- if (_settings.Voice.AlwaysOn.ChatWindowSubmitMode == VoiceChatWindowSubmitMode.WaitForUser)
+ if (_getSubmitMode() == VoiceChatWindowSubmitMode.WaitForUser)
{
return await window.PrepareVoiceTranscriptForManualSendAsync(text)
? VoiceTranscriptSubmitOutcome.DeferredToUser
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
index 592bdc5..6a06aa9 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
@@ -17,7 +17,7 @@
namespace OpenClawTray.Services;
-public sealed class VoiceService : IDisposable
+public sealed class VoiceService : IVoiceRuntime, IDisposable
{
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index d4900b5..e614d74 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -15,6 +15,7 @@
namespace OpenClawTray.Windows;
public sealed partial class WebChatWindow : WindowEx
+ , IVoiceChatWindow
{
private readonly string _gatewayUrl;
private readonly string _token;
@@ -565,9 +566,3 @@ private async Task RefreshTrayVoiceDomStateAsync()
}
}
}
-
-public sealed class VoiceTranscriptSubmittedEventArgs : EventArgs
-{
- public string Text { get; set; } = "";
- public string SessionKey { get; set; } = "main";
-}
diff --git a/tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj b/tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
index f795ca7..cb7fa46 100644
--- a/tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
+++ b/tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj
@@ -1,7 +1,9 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
- <TargetFramework>net10.0</TargetFramework>
+ <TargetFramework>net10.0-windows10.0.19041.0</TargetFramework>
+ <RuntimeIdentifier>win-x64</RuntimeIdentifier>
+ <PlatformTarget>x64</PlatformTarget>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<IsPackable>false</IsPackable>
@@ -19,6 +21,7 @@
<ItemGroup>
<ProjectReference Include="..\..\src\OpenClaw.Shared\OpenClaw.Shared.csproj" />
+ <ProjectReference Include="..\..\src\OpenClaw.Tray.WinUI\OpenClaw.Tray.WinUI.csproj" />
</ItemGroup>
</Project>
diff --git a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
new file mode 100644
index 0000000..c9712bf
--- /dev/null
+++ b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
@@ -0,0 +1,195 @@
+using OpenClaw.Shared;
+using OpenClawTray.Services;
+
+namespace OpenClaw.Tray.Tests;
+
+public class VoiceChatCoordinatorTests
+{
+ [Fact]
+ public async Task AttachWindow_ReplaysBufferedDraft()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
+
+ runtime.RaiseDraft("hello world", "main", clear: false);
+
+ var window = new FakeVoiceChatWindow();
+ coordinator.AttachWindow(window);
+ await Task.Yield();
+
+ Assert.Equal("hello world", window.LastDraftText);
+ Assert.False(window.LastDraftClear);
+ }
+
+ [Fact]
+ public async Task Submitter_AutoSend_UsesChatWindowSubmit()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
+ var window = new FakeVoiceChatWindow { SubmitResult = true };
+ coordinator.AttachWindow(window);
+
+ var result = await runtime.TranscriptSubmitter!("send this", "main");
+
+ Assert.Equal(VoiceTranscriptSubmitOutcome.Submitted, result);
+ Assert.Equal(1, window.TrySubmitCallCount);
+ Assert.Equal(0, window.PrepareCallCount);
+ Assert.Equal("send this", window.LastSubmittedText);
+ }
+
+ [Fact]
+ public async Task Submitter_WaitForUser_PreparesDraftInsteadOfSending()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.WaitForUser, new ImmediateDispatcher());
+ var window = new FakeVoiceChatWindow { PrepareResult = true };
+ coordinator.AttachWindow(window);
+
+ var result = await runtime.TranscriptSubmitter!("draft only", "main");
+
+ Assert.Equal(VoiceTranscriptSubmitOutcome.DeferredToUser, result);
+ Assert.Equal(0, window.TrySubmitCallCount);
+ Assert.Equal(1, window.PrepareCallCount);
+ Assert.Equal("draft only", window.LastPreparedText);
+ }
+
+ [Fact]
+ public async Task Submitter_WithoutWindow_FallsBackAsUnavailable()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
+
+ var result = await runtime.TranscriptSubmitter!("headless", "main");
+
+ Assert.Equal(VoiceTranscriptSubmitOutcome.Unavailable, result);
+ }
+
+ [Fact]
+ public async Task ManualSubmit_NotifiesRuntime_AndClearsBufferedDraft()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.WaitForUser, new ImmediateDispatcher());
+ var firstWindow = new FakeVoiceChatWindow();
+ coordinator.AttachWindow(firstWindow);
+
+ runtime.RaiseDraft("working draft", "main", clear: false);
+ await Task.Yield();
+
+ firstWindow.RaiseSubmitted("final text", "main");
+
+ Assert.Equal("final text", runtime.LastManualSubmitText);
+ Assert.Equal("main", runtime.LastManualSubmitSessionKey);
+
+ var secondWindow = new FakeVoiceChatWindow();
+ coordinator.AttachWindow(secondWindow);
+ await Task.Yield();
+
+ Assert.Equal(string.Empty, secondWindow.LastDraftText);
+ Assert.True(secondWindow.LastDraftClear);
+ }
+
+ [Fact]
+ public void ConversationTurn_IsForwarded()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
+ VoiceConversationTurnEventArgs? received = null;
+ coordinator.ConversationTurnAvailable += (_, args) => received = args;
+
+ runtime.RaiseConversationTurn(new VoiceConversationTurnEventArgs
+ {
+ Direction = VoiceConversationDirection.Incoming,
+ Message = "reply",
+ SessionKey = "main"
+ });
+
+ Assert.NotNull(received);
+ Assert.Equal("reply", received!.Message);
+ Assert.Equal(VoiceConversationDirection.Incoming, received.Direction);
+ }
+
+ private sealed class ImmediateDispatcher : IUiDispatcher
+ {
+ public bool TryEnqueue(Action callback)
+ {
+ callback();
+ return true;
+ }
+ }
+
+ private sealed class FakeVoiceRuntime : IVoiceRuntime
+ {
+ public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
+ public event EventHandler<VoiceTranscriptDraftEventArgs>? TranscriptDraftUpdated;
+ public Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? TranscriptSubmitter { get; set; }
+
+ public string? LastManualSubmitText { get; private set; }
+ public string? LastManualSubmitSessionKey { get; private set; }
+
+ public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null)
+ {
+ LastManualSubmitText = text;
+ LastManualSubmitSessionKey = sessionKey;
+ }
+
+ public void RaiseDraft(string text, string? sessionKey, bool clear)
+ {
+ TranscriptDraftUpdated?.Invoke(this, new VoiceTranscriptDraftEventArgs
+ {
+ Text = text,
+ SessionKey = sessionKey ?? "main",
+ Clear = clear
+ });
+ }
+
+ public void RaiseConversationTurn(VoiceConversationTurnEventArgs args)
+ {
+ ConversationTurnAvailable?.Invoke(this, args);
+ }
+ }
+
+ private sealed class FakeVoiceChatWindow : IVoiceChatWindow
+ {
+ public bool IsClosed { get; set; }
+ public event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
+
+ public string LastDraftText { get; private set; } = string.Empty;
+ public bool LastDraftClear { get; private set; }
+ public string? LastSubmittedText { get; private set; }
+ public string? LastPreparedText { get; private set; }
+ public int TrySubmitCallCount { get; private set; }
+ public int PrepareCallCount { get; private set; }
+ public bool SubmitResult { get; set; }
+ public bool PrepareResult { get; set; }
+
+ public Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
+ {
+ LastDraftText = text;
+ LastDraftClear = clear;
+ return Task.CompletedTask;
+ }
+
+ public Task<bool> TrySubmitVoiceTranscriptAsync(string text)
+ {
+ TrySubmitCallCount++;
+ LastSubmittedText = text;
+ return Task.FromResult(SubmitResult);
+ }
+
+ public Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text)
+ {
+ PrepareCallCount++;
+ LastPreparedText = text;
+ return Task.FromResult(PrepareResult);
+ }
+
+ public void RaiseSubmitted(string text, string sessionKey)
+ {
+ VoiceTranscriptSubmitted?.Invoke(this, new VoiceTranscriptSubmittedEventArgs
+ {
+ Text = text,
+ SessionKey = sessionKey
+ });
+ }
+ }
+}
From 13364724ec2c8600d4f4e195aba47324d59856ba Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 01:40:47 +0000
Subject: [PATCH 08/83] Address voice mode review findings and harden runtime
---
docs/VOICE-MODE.md | 6 +-
.../Services/GlobalHotkeyService.cs | 35 ++-
.../Services/VoiceChatContracts.cs | 2 +-
.../Services/VoiceChatCoordinator.cs | 39 ++-
.../Services/VoiceProviderCatalogService.cs | 13 +-
.../Services/VoiceService.cs | 277 ++++++++++--------
.../Windows/WebChatWindow.xaml.cs | 2 +-
.../VoiceChatCoordinatorTests.cs | 16 +-
8 files changed, 236 insertions(+), 154 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 6cebad9..9af15ff 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -100,7 +100,7 @@ That produced two UX problems:
The tray app keeps a tray-local interim transcript buffer for the current utterance, independent of whether the chat window is open.
-The embedded [WebChatWindow.xaml.cs](C:/dev/openclaw-windows-node/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs) owns the tray-local chat integration layer:
+The embedded [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs) owns the tray-local chat integration layer:
- interim STT hypotheses from Windows speech recognition are injected into the tray chat compose box while the user is speaking
- if the chat window opens during an utterance, the current buffered transcript is copied into the compose box immediately
@@ -221,11 +221,11 @@ The voice subsystem is introduced as a new node capability category: `voice`.
- `VoiceStopArgs`
- `VoiceSettingsUpdateArgs`
-These contracts are defined in [VoiceModeSchema.cs](C:/dev/openclaw-windows-node/src/OpenClaw.Shared/VoiceModeSchema.cs).
+These contracts are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/VoiceModeSchema.cs).
## Settings Schema
-Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](C:/dev/openclaw-windows-node/src/OpenClaw.Shared/SettingsData.cs).
+Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src/OpenClaw.Shared/SettingsData.cs).
### Effective Schema
diff --git a/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs b/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs
index 92496d5..49fe829 100644
--- a/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/GlobalHotkeyService.cs
@@ -109,6 +109,7 @@ private struct POINT
private IntPtr _hwnd;
private bool _registered;
private bool _disposed;
+ private readonly object _sync = new();
private Thread? _messageThread;
private WndProcDelegate? _wndProcDelegate; // prevent GC collection
private volatile bool _running;
@@ -126,12 +127,15 @@ public GlobalHotkeyService()
public bool Register()
{
- if (_registered) return true;
-
try
{
- // Create message window on a dedicated thread with message loop
- EnsureMessageLoop();
+ lock (_sync)
+ {
+ if (_registered) return true;
+
+ // Create message window on a dedicated thread with message loop
+ EnsureMessageLoop();
+ }
if (!_windowReady.Wait(TimeSpan.FromSeconds(2)))
{
@@ -139,18 +143,21 @@ public bool Register()
return false;
}
- if (_hwnd == IntPtr.Zero)
+ lock (_sync)
{
- Logger.Warn("Failed to create hotkey message window");
- return false;
- }
+ if (_hwnd == IntPtr.Zero)
+ {
+ Logger.Warn("Failed to create hotkey message window");
+ return false;
+ }
- _opCompleted.Reset();
- if (!PostMessage(_hwnd, WM_APP_REGISTER, IntPtr.Zero, IntPtr.Zero))
- {
- Logger.Warn("Failed to post WM_APP_REGISTER message for hotkey registration");
- _registered = false;
- return false;
+ _opCompleted.Reset();
+ if (!PostMessage(_hwnd, WM_APP_REGISTER, IntPtr.Zero, IntPtr.Zero))
+ {
+ Logger.Warn("Failed to post WM_APP_REGISTER message for hotkey registration");
+ _registered = false;
+ return false;
+ }
}
if (!_opCompleted.Wait(TimeSpan.FromSeconds(2)))
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
index 81f6733..f3e2a27 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
@@ -44,5 +44,5 @@ public interface IVoiceChatWindow
public sealed class VoiceTranscriptSubmittedEventArgs : EventArgs
{
public string Text { get; set; } = "";
- public string SessionKey { get; set; } = "main";
+ public string? SessionKey { get; set; }
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
index 695701e..28495b4 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
@@ -126,26 +126,33 @@ private void OnWebChatVoiceTranscriptSubmitted(object? sender, VoiceTranscriptSu
private async Task<VoiceTranscriptSubmitOutcome> SubmitVoiceTranscriptAsync(string text, string? sessionKey)
{
- IVoiceChatWindow? window;
- lock (_gate)
+ try
{
- window = _webChatWindow;
- }
+ IVoiceChatWindow? window;
+ lock (_gate)
+ {
+ window = _webChatWindow;
+ }
- if (window == null || window.IsClosed)
- {
- return VoiceTranscriptSubmitOutcome.Unavailable;
- }
+ if (window == null || window.IsClosed)
+ {
+ return VoiceTranscriptSubmitOutcome.Unavailable;
+ }
- if (_getSubmitMode() == VoiceChatWindowSubmitMode.WaitForUser)
- {
- return await window.PrepareVoiceTranscriptForManualSendAsync(text)
- ? VoiceTranscriptSubmitOutcome.DeferredToUser
+ if (_getSubmitMode() == VoiceChatWindowSubmitMode.WaitForUser)
+ {
+ return await window.PrepareVoiceTranscriptForManualSendAsync(text)
+ ? VoiceTranscriptSubmitOutcome.DeferredToUser
+ : VoiceTranscriptSubmitOutcome.Unavailable;
+ }
+
+ return await window.TrySubmitVoiceTranscriptAsync(text)
+ ? VoiceTranscriptSubmitOutcome.Submitted
: VoiceTranscriptSubmitOutcome.Unavailable;
}
-
- return await window.TrySubmitVoiceTranscriptAsync(text)
- ? VoiceTranscriptSubmitOutcome.Submitted
- : VoiceTranscriptSubmitOutcome.Unavailable;
+ catch
+ {
+ return VoiceTranscriptSubmitOutcome.Unavailable;
+ }
}
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
index 705336e..f515d75 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
@@ -9,6 +9,8 @@ namespace OpenClawTray.Services;
public static class VoiceProviderCatalogService
{
+ private const long MaxCatalogBytes = 256 * 1024;
+ private const int MaxProviderEntriesPerList = 64;
private static readonly string s_catalogFilePath = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData),
"OpenClawTray",
@@ -33,6 +35,13 @@ public static VoiceProviderCatalog LoadCatalog(IOpenClawLogger? logger = null)
return merged;
}
+ var fileInfo = new FileInfo(s_catalogFilePath);
+ if (fileInfo.Length > MaxCatalogBytes)
+ {
+ logger?.Warn($"Voice provider catalog exceeds {MaxCatalogBytes} bytes and will be ignored.");
+ return merged;
+ }
+
var json = File.ReadAllText(s_catalogFilePath);
var configured = JsonSerializer.Deserialize<VoiceProviderCatalog>(json, s_jsonOptions);
if (configured == null)
@@ -107,7 +116,9 @@ private static List<VoiceProviderOption> MergeProviders(
.Select(Clone)
.ToDictionary(p => p.Id, StringComparer.OrdinalIgnoreCase);
- foreach (var provider in configured.Where(p => !string.IsNullOrWhiteSpace(p.Id)))
+ foreach (var provider in configured
+ .Where(p => !string.IsNullOrWhiteSpace(p.Id))
+ .Take(MaxProviderEntriesPerList))
{
merged[provider.Id] = Clone(provider);
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
index 6a06aa9..6c00c9f 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
@@ -19,6 +19,7 @@ namespace OpenClawTray.Services;
public sealed class VoiceService : IVoiceRuntime, IDisposable
{
+ private const string DefaultSessionKey = "main";
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
@@ -108,8 +109,6 @@ public Task<VoiceSettings> UpdateSettingsAsync(VoiceSettingsUpdateArgs update)
_status.SessionKey,
_status.State,
_status.LastError);
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
- _status.LastWakeWordUtc = _status.LastWakeWordUtc;
}
else
{
@@ -352,12 +351,19 @@ public void Dispose()
}
_disposed = true;
- _ = StopRuntimeResourcesAsync(updateStoppedStatus: true);
+ try
+ {
+ Task.Run(() => StopRuntimeResourcesAsync(updateStoppedStatus: true)).GetAwaiter().GetResult();
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice runtime dispose cleanup failed: {ex.Message}");
+ }
}
private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? sessionKey)
{
- var effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? "main" : sessionKey;
+ var effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey;
var selectedSpeechToText = VoiceProviderCatalogService.ResolveSpeechToTextProvider(
settings.SpeechToTextProviderId,
_logger);
@@ -368,54 +374,91 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
await EnsureMicrophoneConsentAsync();
- var runtimeCts = new CancellationTokenSource();
- var recognizer = await CreateSpeechRecognizerAsync(settings);
- var synthesizer = new SpeechSynthesizer();
- var player = new MediaPlayer();
-
- if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
- {
- _logger.Warn("Selected input device is saved, but AlwaysOn currently uses the system speech input device.");
- }
+ CancellationTokenSource? runtimeCts = null;
+ SpeechRecognizer? recognizer = null;
+ SpeechSynthesizer? synthesizer = null;
+ MediaPlayer? player = null;
- if (!string.IsNullOrWhiteSpace(settings.OutputDeviceId))
+ try
{
- _logger.Warn("Selected output device is saved, but AlwaysOn currently uses the default speech output device.");
- }
+ runtimeCts = new CancellationTokenSource();
+ recognizer = await CreateSpeechRecognizerAsync(settings);
+ synthesizer = new SpeechSynthesizer();
+ player = new MediaPlayer();
- recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
- recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
- recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
+ if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
+ {
+ _logger.Warn("Selected input device is saved, but AlwaysOn currently uses the system speech input device.");
+ }
- lock (_gate)
- {
- _runtimeCts = runtimeCts;
- _speechRecognizer = recognizer;
- _speechSynthesizer = synthesizer;
- _mediaPlayer = player;
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- effectiveSessionKey,
- VoiceRuntimeState.Arming,
- fallbackMessage);
- }
+ if (!string.IsNullOrWhiteSpace(settings.OutputDeviceId))
+ {
+ _logger.Warn("Selected output device is saved, but AlwaysOn currently uses the default speech output device.");
+ }
- await EnsureChatTransportAsync(runtimeCts.Token);
- await StartRecognitionSessionAsync();
+ recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
+ recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
+ recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
- lock (_gate)
- {
- if (_status.Running)
+ lock (_gate)
{
+ _runtimeCts = runtimeCts;
+ _speechRecognizer = recognizer;
+ _speechSynthesizer = synthesizer;
+ _mediaPlayer = player;
_status = BuildRunningStatus(
VoiceActivationMode.AlwaysOn,
effectiveSessionKey,
- VoiceRuntimeState.ListeningContinuously,
+ VoiceRuntimeState.Arming,
fallbackMessage);
}
+
+ await EnsureChatTransportAsync(runtimeCts.Token);
+ await StartRecognitionSessionAsync();
+
+ lock (_gate)
+ {
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ effectiveSessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ fallbackMessage);
+ }
+ }
+
+ _logger.Info("Voice runtime started in mode AlwaysOn");
}
+ catch
+ {
+ var cleanupStoredState = false;
+ lock (_gate)
+ {
+ cleanupStoredState = ReferenceEquals(_runtimeCts, runtimeCts);
+ }
- _logger.Info("Voice runtime started in mode AlwaysOn");
+ if (cleanupStoredState)
+ {
+ await StopRuntimeResourcesAsync(updateStoppedStatus: false);
+ }
+ else
+ {
+ if (recognizer != null)
+ {
+ try { recognizer.HypothesisGenerated -= OnSpeechHypothesisGenerated; } catch { }
+ try { recognizer.ContinuousRecognitionSession.ResultGenerated -= OnSpeechResultGenerated; } catch { }
+ try { recognizer.ContinuousRecognitionSession.Completed -= OnSpeechRecognitionCompleted; } catch { }
+ try { recognizer.Dispose(); } catch { }
+ }
+
+ try { player?.Dispose(); } catch { }
+ try { synthesizer?.Dispose(); } catch { }
+ try { runtimeCts?.Dispose(); } catch { }
+ }
+
+ throw;
+ }
}
private async Task<SpeechRecognizer> CreateSpeechRecognizerAsync(VoiceSettings settings)
@@ -491,12 +534,11 @@ private async Task EnsureChatTransportAsync(CancellationToken cancellationToken)
readyTask = _transportReadyTcs?.Task ?? Task.CompletedTask;
}
- using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
- timeoutCts.CancelAfter(TransportConnectTimeout);
-
- var completed = await Task.WhenAny(readyTask, Task.Delay(Timeout.InfiniteTimeSpan, timeoutCts.Token));
+ var timeoutTask = Task.Delay(TransportConnectTimeout, cancellationToken);
+ var completed = await Task.WhenAny(readyTask, timeoutTask);
if (completed != readyTask)
{
+ cancellationToken.ThrowIfCancellationRequested();
throw new TimeoutException("Timed out connecting voice chat transport.");
}
@@ -528,7 +570,6 @@ private async Task StartRecognitionSessionAsync()
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
null);
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
}
}
}
@@ -782,7 +823,6 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
"Timed out waiting for an assistant reply.");
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
shouldResume = true;
}
}
@@ -799,98 +839,104 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs args)
{
- if (!args.IsFinal ||
- !string.Equals(args.Role, "assistant", StringComparison.OrdinalIgnoreCase) ||
- string.IsNullOrWhiteSpace(args.Message))
- {
- return;
- }
-
- string text;
-
- lock (_gate)
+ try
{
- if (!_awaitingReply || !_status.Running || _status.Mode != VoiceActivationMode.AlwaysOn)
+ if (!args.IsFinal ||
+ !string.Equals(args.Role, "assistant", StringComparison.OrdinalIgnoreCase) ||
+ string.IsNullOrWhiteSpace(args.Message))
{
return;
}
- if (!IsMatchingSessionKey(args.SessionKey, GetCurrentVoiceSessionKey()))
- {
- return;
- }
-
- _awaitingReply = false;
- _isSpeaking = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- _status.SessionKey,
- VoiceRuntimeState.PlayingResponse,
- _status.LastError);
- text = PrepareReplyForSpeech(args.Message);
- }
+ string text;
- if (string.IsNullOrWhiteSpace(text))
- {
lock (_gate)
{
- _isSpeaking = false;
- if (_status.Running)
+ if (!_awaitingReply || !_status.Running || _status.Mode != VoiceActivationMode.AlwaysOn)
{
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- _status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- _status.LastError);
+ return;
}
- }
- await StartRecognitionSessionAsync();
- return;
- }
+ if (!IsMatchingSessionKey(args.SessionKey, GetCurrentVoiceSessionKey()))
+ {
+ return;
+ }
- try
- {
- RaiseConversationTurn(VoiceConversationDirection.Incoming, text, args.SessionKey);
- await SpeakTextAsync(text);
- }
- catch (Exception ex)
- {
- _logger.Error("Voice reply playback failed", ex);
- lock (_gate)
- {
+ _awaitingReply = false;
+ _isSpeaking = true;
_status = BuildRunningStatus(
VoiceActivationMode.AlwaysOn,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- GetUserFacingErrorMessage(ex));
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
+ text = PrepareReplyForSpeech(args.Message);
}
- }
- finally
- {
- lock (_gate)
+
+ if (string.IsNullOrWhiteSpace(text))
{
- _isSpeaking = false;
- if (_status.Running)
+ lock (_gate)
{
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- _status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- _status.LastError);
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
+ _isSpeaking = false;
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ }
}
+
+ await StartRecognitionSessionAsync();
+ return;
}
try
{
- await StartRecognitionSessionAsync();
+ RaiseConversationTurn(VoiceConversationDirection.Incoming, text, args.SessionKey);
+ await SpeakTextAsync(text);
}
catch (Exception ex)
{
- _logger.Warn($"Voice recognition resume failed: {ex.Message}");
+ _logger.Error("Voice reply playback failed", ex);
+ lock (_gate)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ GetUserFacingErrorMessage(ex));
+ }
+ }
+ finally
+ {
+ lock (_gate)
+ {
+ _isSpeaking = false;
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.AlwaysOn,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ }
+ }
+
+ try
+ {
+ await StartRecognitionSessionAsync();
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice recognition resume failed: {ex.Message}");
+ }
}
}
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ }
}
private async Task SpeakTextAsync(string text)
@@ -994,7 +1040,6 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
_status.LastError);
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
}
}
else if (status == ConnectionStatus.Error)
@@ -1009,7 +1054,6 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
_status.SessionKey,
VoiceRuntimeState.Arming,
"Voice chat transport failed.");
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
}
}
else if (status == ConnectionStatus.Disconnected)
@@ -1021,7 +1065,6 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
_status.SessionKey,
VoiceRuntimeState.Arming,
"Voice chat transport disconnected.");
- _status.LastUtteranceUtc = _status.LastUtteranceUtc;
}
}
}
@@ -1105,13 +1148,13 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
private string GetCurrentVoiceSessionKey()
{
- return string.IsNullOrWhiteSpace(_status.SessionKey) ? "main" : _status.SessionKey!;
+ return string.IsNullOrWhiteSpace(_status.SessionKey) ? DefaultSessionKey : _status.SessionKey!;
}
private static bool IsMatchingSessionKey(string? actualSessionKey, string? expectedSessionKey)
{
- actualSessionKey = string.IsNullOrWhiteSpace(actualSessionKey) ? "main" : actualSessionKey;
- expectedSessionKey = string.IsNullOrWhiteSpace(expectedSessionKey) ? "main" : expectedSessionKey;
+ actualSessionKey = string.IsNullOrWhiteSpace(actualSessionKey) ? DefaultSessionKey : actualSessionKey;
+ expectedSessionKey = string.IsNullOrWhiteSpace(expectedSessionKey) ? DefaultSessionKey : expectedSessionKey;
if (string.Equals(actualSessionKey, expectedSessionKey, StringComparison.Ordinal))
{
@@ -1123,7 +1166,7 @@ private static bool IsMatchingSessionKey(string? actualSessionKey, string? expec
private static bool IsMainSessionKey(string sessionKey)
{
- return sessionKey == "main" || sessionKey.Contains(":main:", StringComparison.Ordinal);
+ return sessionKey == DefaultSessionKey || sessionKey.Contains(":main:", StringComparison.Ordinal);
}
private static string PrepareReplyForSpeech(string text)
@@ -1344,7 +1387,7 @@ private void RaiseConversationTurn(VoiceConversationDirection direction, string
{
Direction = direction,
Message = text,
- SessionKey = string.IsNullOrWhiteSpace(sessionKey) ? "main" : sessionKey,
+ SessionKey = string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey,
Mode = _runtimeModeOverride ?? _settings.Voice.Mode
});
}
@@ -1353,7 +1396,7 @@ private void RaiseTranscriptDraft(string text, string? sessionKey, bool clear)
{
TranscriptDraftUpdated?.Invoke(this, new VoiceTranscriptDraftEventArgs
{
- SessionKey = string.IsNullOrWhiteSpace(sessionKey) ? "main" : sessionKey,
+ SessionKey = string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey,
Text = clear ? string.Empty : text,
Clear = clear,
Mode = _runtimeModeOverride ?? _settings.Voice.Mode
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index e614d74..219d928 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -340,7 +340,7 @@ private async Task InitializeWebViewAsync()
VoiceTranscriptSubmitted?.Invoke(this, new VoiceTranscriptSubmittedEventArgs
{
Text = text,
- SessionKey = "main"
+ SessionKey = null
});
}
catch (Exception ex)
diff --git a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
index c9712bf..e19e3c6 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
@@ -88,6 +88,20 @@ public async Task ManualSubmit_NotifiesRuntime_AndClearsBufferedDraft()
Assert.True(secondWindow.LastDraftClear);
}
+ [Fact]
+ public void ManualSubmit_AllowsRuntimeToUseCurrentSession_WhenWindowDoesNotSpecifyOne()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.WaitForUser, new ImmediateDispatcher());
+ var window = new FakeVoiceChatWindow();
+ coordinator.AttachWindow(window);
+
+ window.RaiseSubmitted("follow up", null);
+
+ Assert.Equal("follow up", runtime.LastManualSubmitText);
+ Assert.Null(runtime.LastManualSubmitSessionKey);
+ }
+
[Fact]
public void ConversationTurn_IsForwarded()
{
@@ -183,7 +197,7 @@ public Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text)
return Task.FromResult(PrepareResult);
}
- public void RaiseSubmitted(string text, string sessionKey)
+ public void RaiseSubmitted(string text, string? sessionKey)
{
VoiceTranscriptSubmitted?.Invoke(this, new VoiceTranscriptSubmittedEventArgs
{
From 0f1028a052b31e8940e169e0151b1fdd1accf974 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 02:14:22 +0000
Subject: [PATCH 09/83] Document required Minimax and ElevenLabs provider
support
---
docs/VOICE-MODE.md | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 9af15ff..9a36723 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -10,13 +10,13 @@ This document defines the voice subsystem for the Windows node only. It introduc
- `WakeWord` maps to Voice Wake
- `AlwaysOn` maps to Talk Mode
- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
-- Keep provider-specific STT/TTS concerns separate from the Windows node by default
+- Implement `MiniMax` STT and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
- Reuse the existing node capability pattern instead of introducing a parallel control path
## Non-Goals
- True full-duplex or chunk-streaming audio transport between node and gateway
-- Provider-specific STT/TTS routing in the Windows node
+- Arbitrary provider proliferation before the required `MiniMax` / `ElevenLabs` support is in place
- Changes to unrelated project documentation
## Design Position
@@ -133,6 +133,7 @@ The built-in default for both is `windows`.
Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
+- `minimax` and `elevenlabs` are required next-phase providers, not optional future nice-to-haves
- non-Windows providers can be selected and persisted now
- unsupported providers fall back to Windows at runtime with a status warning
@@ -159,7 +160,7 @@ Example:
"name": "MiniMax Speech To Text",
"runtime": "gateway",
"enabled": true,
- "description": "Planned future provider."
+ "description": "Required next-phase provider."
}
],
"textToSpeechProviders": [
@@ -175,7 +176,7 @@ Example:
"name": "ElevenLabs",
"runtime": "gateway",
"enabled": true,
- "description": "Planned future provider."
+ "description": "Required next-phase provider."
}
]
}
@@ -183,15 +184,18 @@ Example:
This file only defines selectable providers. It does not carry API keys.
-### OpenClaw Configuration Discovery
+### OpenClaw Configuration and Credentials
-It may be technically possible to inspect parts of the OpenClaw configuration surface to infer preferred providers. However, the documented config protocol notes that sensitive fields have no redaction layer, so automatically pulling provider credentials into the Windows tray is not a safe default.
+For now, `MiniMax` and `ElevenLabs` credentials will be stored in the main OpenClaw configuration, not in the Windows tray settings or the local provider catalog file.
-Because of that, this design keeps provider selection local for now:
+That means the current design is:
- local tray settings choose the preferred STT/TTS provider ids
+- provider API keys are read from the main OpenClaw configuration
- OpenClaw remains the conversation endpoint for `chat.send`
-- future provider adapters can decide whether they use local credentials, gateway-owned credentials, or both
+- the local provider catalog remains metadata-only and must not contain secrets
+
+This is an intentional short-term design choice so the next implementation step can add `MiniMax` support without inventing a second credential store in Windows. It can be revisited later if provider ownership is split differently.
For `WakeWord`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
From 2c8a46d6d97775f1553a75c16b64171e885dff73 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 02:37:21 +0000
Subject: [PATCH 10/83] Harden tray chat voice message handling
---
.../Windows/WebChatWindow.xaml.cs | 87 ++++++++++++++++---
.../WebChatWindowSecurityTests.cs | 71 +++++++++++++++
2 files changed, 144 insertions(+), 14 deletions(-)
create mode 100644 tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 219d928..5a7be87 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -17,9 +17,12 @@ namespace OpenClawTray.Windows;
public sealed partial class WebChatWindow : WindowEx
, IVoiceChatWindow
{
+ private const string VoiceManualSubmitMessageType = "voice-manual-submit";
private readonly string _gatewayUrl;
private readonly string _token;
+ private readonly string _voiceMessageNonce = Guid.NewGuid().ToString("N");
private string _pendingVoiceDraft = string.Empty;
+ private string? _trustedVoiceMessageOrigin;
// Store event handlers for cleanup
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationCompletedEventArgs>? _navigationCompletedHandler;
@@ -29,8 +32,9 @@ public sealed partial class WebChatWindow : WindowEx
public bool IsClosed { get; private set; }
public event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
- private const string TrayVoiceIntegrationScript = """
+ private const string TrayVoiceIntegrationScriptTemplate = """
(() => {
+ const submitNonce = __VOICE_NONCE__;
const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
const sanitize = (value) => typeof value === 'string' ? value.replace(memoryPattern, '').trimStart() : value;
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
@@ -123,7 +127,7 @@ public sealed partial class WebChatWindow : WindowEx
pendingManual = false;
desiredDraft = '';
if (window.chrome?.webview?.postMessage) {
- window.chrome.webview.postMessage(JSON.stringify({ type: 'voice-manual-submit', text: current }));
+ window.chrome.webview.postMessage(JSON.stringify({ type: 'voice-manual-submit', text: current, nonce: submitNonce }));
}
};
const cleanTextNodes = () => {
@@ -277,7 +281,8 @@ private async Task InitializeWebViewAsync()
WebView.CoreWebView2.Settings.IsStatusBarEnabled = false;
WebView.CoreWebView2.Settings.AreDefaultContextMenusEnabled = true;
WebView.CoreWebView2.Settings.IsZoomControlEnabled = true;
- await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(TrayVoiceIntegrationScript);
+ await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(
+ BuildTrayVoiceIntegrationScript(_voiceMessageNonce));
// Handle navigation events (store for cleanup)
_navigationCompletedHandler = (s, e) =>
@@ -322,17 +327,12 @@ private async Task InitializeWebViewAsync()
{
try
{
- using var doc = JsonDocument.Parse(e.TryGetWebMessageAsString());
- if (!doc.RootElement.TryGetProperty("type", out var typeProp) ||
- !string.Equals(typeProp.GetString(), "voice-manual-submit", StringComparison.Ordinal))
- {
- return;
- }
-
- var text = doc.RootElement.TryGetProperty("text", out var textProp)
- ? textProp.GetString() ?? string.Empty
- : string.Empty;
- if (string.IsNullOrWhiteSpace(text))
+ if (!TryExtractTrustedVoiceManualSubmit(
+ e.TryGetWebMessageAsString(),
+ e.Source,
+ _trustedVoiceMessageOrigin,
+ _voiceMessageNonce,
+ out var text))
{
return;
}
@@ -444,6 +444,7 @@ private void NavigateToChat()
if (!string.IsNullOrEmpty(DEBUG_TEST_URL))
{
Logger.Info($"WebChatWindow: DEBUG MODE - Navigating to test URL: {DEBUG_TEST_URL}");
+ _trustedVoiceMessageOrigin = TryGetOrigin(DEBUG_TEST_URL);
WebView.CoreWebView2.Navigate(DEBUG_TEST_URL);
return;
}
@@ -457,9 +458,67 @@ private void NavigateToChat()
var safeBaseUrl = url.Split('?')[0];
Logger.Info($"WebChatWindow: Navigating to {safeBaseUrl} (token hidden)");
+ _trustedVoiceMessageOrigin = TryGetOrigin(url);
WebView.CoreWebView2.Navigate(url);
}
+ private static string BuildTrayVoiceIntegrationScript(string nonce)
+ {
+ return TrayVoiceIntegrationScriptTemplate.Replace(
+ "__VOICE_NONCE__",
+ JsonSerializer.Serialize(nonce),
+ StringComparison.Ordinal);
+ }
+
+ private static bool TryExtractTrustedVoiceManualSubmit(
+ string payload,
+ string? source,
+ string? expectedOrigin,
+ string expectedNonce,
+ out string text)
+ {
+ text = string.Empty;
+
+ if (!IsTrustedVoiceMessageSource(source, expectedOrigin))
+ {
+ return false;
+ }
+
+ using var doc = JsonDocument.Parse(payload);
+ if (!doc.RootElement.TryGetProperty("type", out var typeProp) ||
+ !string.Equals(typeProp.GetString(), VoiceManualSubmitMessageType, StringComparison.Ordinal))
+ {
+ return false;
+ }
+
+ if (!doc.RootElement.TryGetProperty("nonce", out var nonceProp) ||
+ !string.Equals(nonceProp.GetString(), expectedNonce, StringComparison.Ordinal))
+ {
+ return false;
+ }
+
+ text = doc.RootElement.TryGetProperty("text", out var textProp)
+ ? textProp.GetString() ?? string.Empty
+ : string.Empty;
+
+ return !string.IsNullOrWhiteSpace(text);
+ }
+
+ private static bool IsTrustedVoiceMessageSource(string? source, string? expectedOrigin)
+ {
+ var actualOrigin = TryGetOrigin(source);
+ return !string.IsNullOrWhiteSpace(expectedOrigin) &&
+ !string.IsNullOrWhiteSpace(actualOrigin) &&
+ string.Equals(actualOrigin, expectedOrigin, StringComparison.OrdinalIgnoreCase);
+ }
+
+ private static string? TryGetOrigin(string? url)
+ {
+ return Uri.TryCreate(url, UriKind.Absolute, out var uri)
+ ? uri.GetLeftPart(UriPartial.Authority)
+ : null;
+ }
+
private void OnHome(object sender, RoutedEventArgs e)
{
NavigateToChat();
diff --git a/tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs b/tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs
new file mode 100644
index 0000000..ea07286
--- /dev/null
+++ b/tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs
@@ -0,0 +1,71 @@
+using System.Reflection;
+using OpenClawTray.Windows;
+
+namespace OpenClaw.Tray.Tests;
+
+public class WebChatWindowSecurityTests
+{
+ [Fact]
+ public void TrustedVoiceSubmit_AllowsExpectedOriginAndNonce()
+ {
+ var method = GetTrustedSubmitMethod();
+ var arguments = new object?[]
+ {
+ """{"type":"voice-manual-submit","text":"hello world","nonce":"expected-nonce"}""",
+ "https://chat.example.test/path?x=1",
+ "https://chat.example.test",
+ "expected-nonce",
+ null
+ };
+
+ var accepted = (bool)method.Invoke(null, arguments)!;
+
+ Assert.True(accepted);
+ Assert.Equal("hello world", arguments[4]);
+ }
+
+ [Fact]
+ public void TrustedVoiceSubmit_RejectsUnexpectedOrigin()
+ {
+ var method = GetTrustedSubmitMethod();
+ var arguments = new object?[]
+ {
+ """{"type":"voice-manual-submit","text":"hello world","nonce":"expected-nonce"}""",
+ "https://evil.example.test/",
+ "https://chat.example.test",
+ "expected-nonce",
+ null
+ };
+
+ var accepted = (bool)method.Invoke(null, arguments)!;
+
+ Assert.False(accepted);
+ Assert.Equal(string.Empty, arguments[4]);
+ }
+
+ [Fact]
+ public void TrustedVoiceSubmit_RejectsUnexpectedNonce()
+ {
+ var method = GetTrustedSubmitMethod();
+ var arguments = new object?[]
+ {
+ """{"type":"voice-manual-submit","text":"hello world","nonce":"wrong-nonce"}""",
+ "https://chat.example.test/",
+ "https://chat.example.test",
+ "expected-nonce",
+ null
+ };
+
+ var accepted = (bool)method.Invoke(null, arguments)!;
+
+ Assert.False(accepted);
+ Assert.Equal(string.Empty, arguments[4]);
+ }
+
+ private static MethodInfo GetTrustedSubmitMethod()
+ {
+ return typeof(WebChatWindow).GetMethod(
+ "TryExtractTrustedVoiceManualSubmit",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ }
+}
From fdbf48e04045993fdbaa75e1052c8029596e75ed Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 02:37:41 +0000
Subject: [PATCH 11/83] Fix voice transport connection task reuse
---
.../Services/VoiceService.cs | 50 +++++++++++------
.../VoiceServiceTransportTests.cs | 54 +++++++++++++++++++
2 files changed, 88 insertions(+), 16 deletions(-)
create mode 100644 tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
index 6c00c9f..f21d40d 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
@@ -500,40 +500,43 @@ private async Task EnsureMicrophoneConsentAsync()
private async Task EnsureChatTransportAsync(CancellationToken cancellationToken)
{
OpenClawGatewayClient? existingClient;
- ConnectionStatus existingStatus;
+ TaskCompletionSource<bool> readySource;
+ bool shouldStartConnection;
lock (_gate)
{
existingClient = _chatClient;
- existingStatus = _chatTransportStatus;
- if (existingStatus == ConnectionStatus.Connected)
+ if (_chatTransportStatus == ConnectionStatus.Connected)
{
return;
}
- _transportReadyTcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ readySource = GetOrCreateTransportReadySource(
+ _chatTransportStatus,
+ _transportReadyTcs,
+ out shouldStartConnection);
+ _transportReadyTcs = readySource;
- if (existingClient == null)
+ if (shouldStartConnection)
{
- _chatClient = new OpenClawGatewayClient(_settings.GatewayUrl, _settings.Token, _logger);
- _chatClient.StatusChanged += OnChatTransportStatusChanged;
- _chatClient.ChatMessageReceived += OnChatMessageReceived;
- existingClient = _chatClient;
_chatTransportStatus = ConnectionStatus.Connecting;
+
+ if (existingClient == null)
+ {
+ _chatClient = new OpenClawGatewayClient(_settings.GatewayUrl, _settings.Token, _logger);
+ _chatClient.StatusChanged += OnChatTransportStatusChanged;
+ _chatClient.ChatMessageReceived += OnChatMessageReceived;
+ existingClient = _chatClient;
+ }
}
}
- if (existingStatus == ConnectionStatus.Disconnected || existingClient != _chatClient)
+ if (shouldStartConnection)
{
await existingClient!.ConnectAsync();
}
- Task readyTask;
- lock (_gate)
- {
- readyTask = _transportReadyTcs?.Task ?? Task.CompletedTask;
- }
-
+ var readyTask = readySource.Task;
var timeoutTask = Task.Delay(TransportConnectTimeout, cancellationToken);
var completed = await Task.WhenAny(readyTask, timeoutTask);
if (completed != readyTask)
@@ -545,6 +548,21 @@ private async Task EnsureChatTransportAsync(CancellationToken cancellationToken)
await readyTask;
}
+ private static TaskCompletionSource<bool> GetOrCreateTransportReadySource(
+ ConnectionStatus transportStatus,
+ TaskCompletionSource<bool>? existingReadySource,
+ out bool shouldStartConnection)
+ {
+ if (transportStatus == ConnectionStatus.Connecting && existingReadySource != null)
+ {
+ shouldStartConnection = false;
+ return existingReadySource;
+ }
+
+ shouldStartConnection = true;
+ return new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ }
+
private async Task StartRecognitionSessionAsync()
{
SpeechRecognizer? recognizer;
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
new file mode 100644
index 0000000..8276939
--- /dev/null
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -0,0 +1,54 @@
+using System.Reflection;
+using OpenClaw.Shared;
+using OpenClawTray.Services;
+
+namespace OpenClaw.Tray.Tests;
+
+public class VoiceServiceTransportTests
+{
+ [Fact]
+ public void GetOrCreateTransportReadySource_ReusesExistingTaskWhileConnecting()
+ {
+ var method = GetMethod();
+ var existing = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ var arguments = new object?[] { ConnectionStatus.Connecting, existing, null };
+
+ var result = (TaskCompletionSource<bool>)method.Invoke(null, arguments)!;
+
+ Assert.Same(existing, result);
+ Assert.False((bool)arguments[2]!);
+ }
+
+ [Fact]
+ public void GetOrCreateTransportReadySource_CreatesFreshTaskWhenDisconnected()
+ {
+ var method = GetMethod();
+ var existing = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ var arguments = new object?[] { ConnectionStatus.Disconnected, existing, null };
+
+ var result = (TaskCompletionSource<bool>)method.Invoke(null, arguments)!;
+
+ Assert.NotSame(existing, result);
+ Assert.True((bool)arguments[2]!);
+ }
+
+ [Fact]
+ public void GetOrCreateTransportReadySource_CreatesFreshTaskAfterError()
+ {
+ var method = GetMethod();
+ var existing = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ var arguments = new object?[] { ConnectionStatus.Error, existing, null };
+
+ var result = (TaskCompletionSource<bool>)method.Invoke(null, arguments)!;
+
+ Assert.NotSame(existing, result);
+ Assert.True((bool)arguments[2]!);
+ }
+
+ private static MethodInfo GetMethod()
+ {
+ return typeof(VoiceService).GetMethod(
+ "GetOrCreateTransportReadySource",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ }
+}
From b556c647eca40f69db872d32ce4d7dfd051715d3 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 10:18:58 +0000
Subject: [PATCH 12/83] Group voice runtime services under Services/Voice
---
src/OpenClaw.Tray.WinUI/App.xaml.cs | 1 +
src/OpenClaw.Tray.WinUI/Services/NodeService.cs | 1 +
.../Services/{ => Voice}/VoiceChatContracts.cs | 2 +-
.../Services/{ => Voice}/VoiceChatCoordinator.cs | 2 +-
.../Services/{ => Voice}/VoiceProviderCatalogService.cs | 2 +-
src/OpenClaw.Tray.WinUI/Services/{ => Voice}/VoiceService.cs | 2 +-
src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs | 1 +
src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs | 1 +
tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs | 2 +-
tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs | 2 +-
10 files changed, 10 insertions(+), 6 deletions(-)
rename src/OpenClaw.Tray.WinUI/Services/{ => Voice}/VoiceChatContracts.cs (97%)
rename src/OpenClaw.Tray.WinUI/Services/{ => Voice}/VoiceChatCoordinator.cs (99%)
rename src/OpenClaw.Tray.WinUI/Services/{ => Voice}/VoiceProviderCatalogService.cs (99%)
rename src/OpenClaw.Tray.WinUI/Services/{ => Voice}/VoiceService.cs (99%)
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 9d88d28..9d69256 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -6,6 +6,7 @@
using OpenClawTray.Dialogs;
using OpenClawTray.Helpers;
using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
using OpenClawTray.Windows;
using System;
using System.Collections.Generic;
diff --git a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
index c005400..6869eb7 100644
--- a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
@@ -5,6 +5,7 @@
using OpenClaw.Shared;
using OpenClaw.Shared.Capabilities;
using OpenClawTray.Helpers;
+using OpenClawTray.Services.Voice;
using OpenClawTray.Windows;
using Microsoft.UI.Xaml;
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
similarity index 97%
rename from src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
rename to src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
index f3e2a27..17d7920 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceChatContracts.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
@@ -2,7 +2,7 @@
using System;
using System.Threading.Tasks;
-namespace OpenClawTray.Services;
+namespace OpenClawTray.Services.Voice;
public interface IUiDispatcher
{
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
similarity index 99%
rename from src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
rename to src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
index 28495b4..39e88ff 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceChatCoordinator.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
@@ -2,7 +2,7 @@
using System;
using System.Threading.Tasks;
-namespace OpenClawTray.Services;
+namespace OpenClawTray.Services.Voice;
public sealed class VoiceChatCoordinator : IDisposable
{
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
similarity index 99%
rename from src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
rename to src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index f515d75..9f217cd 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -5,7 +5,7 @@
using System.Text.Json;
using OpenClaw.Shared;
-namespace OpenClawTray.Services;
+namespace OpenClawTray.Services.Voice;
public static class VoiceProviderCatalogService
{
diff --git a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
similarity index 99%
rename from src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
rename to src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index f21d40d..c8a0b0c 100644
--- a/src/OpenClaw.Tray.WinUI/Services/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -15,7 +15,7 @@
using Windows.Media.SpeechRecognition;
using Windows.Media.SpeechSynthesis;
-namespace OpenClawTray.Services;
+namespace OpenClawTray.Services.Voice;
public sealed class VoiceService : IVoiceRuntime, IDisposable
{
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index 334f3eb..63d8d27 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -3,6 +3,7 @@
using OpenClaw.Shared;
using OpenClawTray.Helpers;
using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
using System;
using System.Collections.Generic;
using System.Diagnostics;
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 5a7be87..1bd1acd 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -3,6 +3,7 @@
using OpenClaw.Shared;
using OpenClawTray.Helpers;
using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
using System;
using System.Diagnostics;
using System.IO;
diff --git a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
index e19e3c6..62151a7 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
@@ -1,5 +1,5 @@
using OpenClaw.Shared;
-using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
namespace OpenClaw.Tray.Tests;
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 8276939..8a3e43c 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -1,6 +1,6 @@
using System.Reflection;
using OpenClaw.Shared;
-using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
namespace OpenClaw.Tray.Tests;
From 7f31c12d4fbd93412df942ce5cec06fae9e607e0 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 10:54:37 +0000
Subject: [PATCH 13/83] Implement MiniMax TTS for voice mode
---
.gitignore | 1 +
docs/VOICE-MODE.md | 59 ++++---
src/OpenClaw.Shared/SettingsData.cs | 1 +
src/OpenClaw.Shared/VoiceModeSchema.cs | 8 +
.../Services/SettingsManager.cs | 5 +-
.../Voice/VoiceProviderCatalogService.cs | 24 +++
.../Services/Voice/VoiceService.cs | 162 +++++++++++++++++-
.../Windows/VoiceModeWindow.xaml.cs | 16 +-
.../VoiceModeSchemaTests.cs | 17 ++
.../SettingsRoundTripTests.cs | 11 ++
.../VoiceProviderCatalogServiceTests.cs | 25 +++
11 files changed, 300 insertions(+), 29 deletions(-)
create mode 100644 tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
diff --git a/.gitignore b/.gitignore
index 6b3d49e..9ed19c0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -62,6 +62,7 @@ BenchmarkDotNet.Artifacts/
project.lock.json
project.fragment.lock.json
artifacts/
+.env
# ASP.NET Scaffolding
ScaffoldingReadMe.txt
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 9a36723..04bb0b0 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -10,13 +10,13 @@ This document defines the voice subsystem for the Windows node only. It introduc
- `WakeWord` maps to Voice Wake
- `AlwaysOn` maps to Talk Mode
- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
-- Implement `MiniMax` STT and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
+- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
- Reuse the existing node capability pattern instead of introducing a parallel control path
## Non-Goals
- True full-duplex or chunk-streaming audio transport between node and gateway
-- Arbitrary provider proliferation before the required `MiniMax` / `ElevenLabs` support is in place
+- Arbitrary provider proliferation before the required `MiniMax` / `ElevenLabs` TTS support is in place
- Changes to unrelated project documentation
## Design Position
@@ -133,7 +133,8 @@ The built-in default for both is `windows`.
Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
-- `minimax` and `elevenlabs` are required next-phase providers, not optional future nice-to-haves
+- `minimax` TTS is implemented with `speech-2.8-turbo` and `English_MatureBoss`
+- `elevenlabs` TTS remains required next-phase work, not an optional future nice-to-have
- non-Windows providers can be selected and persisted now
- unsupported providers fall back to Windows at runtime with a status warning
@@ -155,13 +156,6 @@ Example:
"enabled": true,
"description": "Built-in Windows dictation and speech recognition."
},
- {
- "id": "minimax",
- "name": "MiniMax Speech To Text",
- "runtime": "gateway",
- "enabled": true,
- "description": "Required next-phase provider."
- }
],
"textToSpeechProviders": [
{
@@ -171,6 +165,13 @@ Example:
"enabled": true,
"description": "Built-in Windows text-to-speech playback."
},
+ {
+ "id": "minimax",
+ "name": "MiniMax Speech 2.8 Turbo",
+ "runtime": "cloud",
+ "enabled": true,
+ "description": "speech-2.8-turbo using English_MatureBoss."
+ },
{
"id": "elevenlabs",
"name": "ElevenLabs",
@@ -184,18 +185,21 @@ Example:
This file only defines selectable providers. It does not carry API keys.
-### OpenClaw Configuration and Credentials
-
-For now, `MiniMax` and `ElevenLabs` credentials will be stored in the main OpenClaw configuration, not in the Windows tray settings or the local provider catalog file.
+### Local Credentials
That means the current design is:
- local tray settings choose the preferred STT/TTS provider ids
-- provider API keys are read from the main OpenClaw configuration
+- provider API keys are stored in `%APPDATA%\\OpenClawTray\\settings.json` under `VoiceProviderCredentials`
- OpenClaw remains the conversation endpoint for `chat.send`
- the local provider catalog remains metadata-only and must not contain secrets
-This is an intentional short-term design choice so the next implementation step can add `MiniMax` support without inventing a second credential store in Windows. It can be revisited later if provider ownership is split differently.
+This is an intentional short-term design choice so the Windows tray app can use cloud TTS providers without inventing a second catalog file for secrets. It can be revisited later if provider ownership is split differently.
+
+Current credential fields:
+
+- `VoiceProviderCredentials.MiniMaxApiKey`
+- `VoiceProviderCredentials.ElevenLabsApiKey`
For `WakeWord`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
@@ -230,6 +234,7 @@ These contracts are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/Voice
## Settings Schema
Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src/OpenClaw.Shared/SettingsData.cs).
+Provider credentials are persisted as `SettingsData.VoiceProviderCredentials` in the same local settings file.
### Effective Schema
@@ -259,6 +264,10 @@ Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src
"MaxUtteranceMs": 15000,
"ChatWindowSubmitMode": "AutoSend"
}
+ },
+ "VoiceProviderCredentials": {
+ "MiniMaxApiKey": "<local secret>",
+ "ElevenLabsApiKey": null
}
}
```
@@ -301,6 +310,8 @@ Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src
| `Voice.AlwaysOn.EndSilenceMs` | int | `900` | always-on | Silence timeout used to finalize an utterance |
| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
| `Voice.AlwaysOn.ChatWindowSubmitMode` | enum | `AutoSend` | always-on | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
+| `VoiceProviderCredentials.MiniMaxApiKey` | string? | `null` | minimax tts | API key used for MiniMax cloud TTS requests |
+| `VoiceProviderCredentials.ElevenLabsApiKey` | string? | `null` | elevenlabs tts | Reserved for the required ElevenLabs TTS implementation |
At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
@@ -426,12 +437,18 @@ sequenceDiagram
- `VoiceCoordinator` in `OpenClaw.Tray.WinUI.Services`
- `AudioPlaybackService` in `OpenClaw.Tray.WinUI.Services`
-## Why Provider Support Is Abstracted
+## Provider Direction
+
+Provider support is now part of the Windows voice subsystem roadmap, not a hypothetical extension:
+
+- `MiniMax` TTS is implemented first in the tray app
+- `ElevenLabs` TTS remains required follow-up work
+- Windows STT remains the active speech-recognition baseline until a non-Windows STT provider is deliberately added
-Minimax and ElevenLabs are valid future targets, but binding provider choice into the Windows node now would introduce:
+The Windows node still keeps provider choice bounded:
-- duplicated provider integration work already handled by OpenClaw
-- local credential management on Windows
-- tighter coupling between node runtime and vendor APIs
+- local tray settings choose the provider ids
+- local tray settings store the provider secrets for now
+- OpenClaw still owns the conversation/session flow
-For the first implementation, the Windows node should manage local audio behavior, local speech recognition, and local playback while reusing existing OpenClaw message flows for conversation. If provider routing becomes a real requirement later, it can be added back without changing the core activation-mode model.
+This keeps the provider surface narrow while still meeting the required MiniMax/ElevenLabs support direction.
diff --git a/src/OpenClaw.Shared/SettingsData.cs b/src/OpenClaw.Shared/SettingsData.cs
index c7af724..e421648 100644
--- a/src/OpenClaw.Shared/SettingsData.cs
+++ b/src/OpenClaw.Shared/SettingsData.cs
@@ -27,6 +27,7 @@ public class SettingsData
public bool PreferStructuredCategories { get; set; } = true;
public List<UserNotificationRule>? UserRules { get; set; }
public VoiceSettings Voice { get; set; } = new();
+ public VoiceProviderCredentials VoiceProviderCredentials { get; set; } = new();
private static readonly JsonSerializerOptions s_options = new() { WriteIndented = true };
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 3eaa0b4..d9a85fd 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -136,6 +136,14 @@ public sealed class VoiceSettingsUpdateArgs
public static class VoiceProviderIds
{
public const string Windows = "windows";
+ public const string MiniMax = "minimax";
+ public const string ElevenLabs = "elevenlabs";
+}
+
+public sealed class VoiceProviderCredentials
+{
+ public string? MiniMaxApiKey { get; set; }
+ public string? ElevenLabsApiKey { get; set; }
}
public sealed class VoiceProviderOption
diff --git a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
index 2fc93d7..9db01d9 100644
--- a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
@@ -43,6 +43,7 @@ public class SettingsManager
public bool PreferStructuredCategories { get; set; } = true;
public List<OpenClaw.Shared.UserNotificationRule> UserRules { get; set; } = new();
public VoiceSettings Voice { get; set; } = new();
+ public VoiceProviderCredentials VoiceProviderCredentials { get; set; } = new();
// Node mode (enables Windows as a node, not just operator)
public bool EnableNodeMode { get; set; } = false;
@@ -84,6 +85,7 @@ public void Load()
if (loaded.UserRules != null)
UserRules = loaded.UserRules;
Voice = loaded.Voice ?? new VoiceSettings();
+ VoiceProviderCredentials = loaded.VoiceProviderCredentials ?? new VoiceProviderCredentials();
}
}
}
@@ -120,7 +122,8 @@ public void Save()
NotifyChatResponses = NotifyChatResponses,
PreferStructuredCategories = PreferStructuredCategories,
UserRules = UserRules,
- Voice = Voice
+ Voice = Voice,
+ VoiceProviderCredentials = VoiceProviderCredentials
};
var json = data.ToJson();
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index 9f217cd..c6132a3 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -81,6 +81,16 @@ public static bool SupportsWindowsRuntime(string? providerId)
return string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
}
+ public static bool SupportsMiniMaxTextToSpeech(string? providerId)
+ {
+ return string.Equals(providerId, VoiceProviderIds.MiniMax, StringComparison.OrdinalIgnoreCase);
+ }
+
+ public static bool SupportsTextToSpeechRuntime(string? providerId)
+ {
+ return SupportsWindowsRuntime(providerId) || SupportsMiniMaxTextToSpeech(providerId);
+ }
+
private static VoiceProviderCatalog CreateBuiltInCatalog()
{
return new VoiceProviderCatalog
@@ -103,6 +113,20 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
Name = "Windows Speech Synthesis",
Runtime = "windows",
Description = "Built-in Windows text-to-speech playback."
+ },
+ new VoiceProviderOption
+ {
+ Id = VoiceProviderIds.MiniMax,
+ Name = "MiniMax Speech 2.8 Turbo",
+ Runtime = "cloud",
+ Description = "Cloud TTS using speech-2.8-turbo with English_MatureBoss."
+ },
+ new VoiceProviderOption
+ {
+ Id = VoiceProviderIds.ElevenLabs,
+ Name = "ElevenLabs",
+ Runtime = "cloud",
+ Description = "Cloud TTS provider planned for the next phase."
}
]
};
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index c8a0b0c..1e52934 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1,6 +1,11 @@
using System;
using System.Collections.Generic;
+using System.Globalization;
using System.Linq;
+using System.Net.Http;
+using System.Net.Http.Headers;
+using System.Runtime.InteropServices.WindowsRuntime;
+using System.Text;
using System.Text.Json;
using System.Threading;
using System.Threading.Tasks;
@@ -14,16 +19,21 @@
using Windows.Media.Playback;
using Windows.Media.SpeechRecognition;
using Windows.Media.SpeechSynthesis;
+using Windows.Storage.Streams;
namespace OpenClawTray.Services.Voice;
public sealed class VoiceService : IVoiceRuntime, IDisposable
{
private const string DefaultSessionKey = "main";
+ private const string MiniMaxTtsEndpoint = "https://api.minimax.io/v1/t2a_v2";
+ private const string MiniMaxTtsModel = "speech-2.8-turbo";
+ private const string MiniMaxTtsVoiceId = "English_MatureBoss";
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromSeconds(2);
+ private static readonly HttpClient s_httpClient = CreateHttpClient();
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
@@ -959,21 +969,49 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
private async Task SpeakTextAsync(string text)
{
+ VoiceSettings settings;
+ VoiceProviderCredentials credentials;
SpeechSynthesizer? synthesizer;
MediaPlayer? player;
lock (_gate)
{
+ settings = Clone(_settings.Voice);
+ credentials = Clone(_settings.VoiceProviderCredentials);
synthesizer = _speechSynthesizer;
player = _mediaPlayer;
}
- if (synthesizer == null || player == null)
+ if (player == null)
+ {
+ throw new InvalidOperationException("Speech playback is not ready.");
+ }
+
+ var provider = VoiceProviderCatalogService.ResolveTextToSpeechProvider(
+ settings.TextToSpeechProviderId,
+ _logger);
+
+ if (VoiceProviderCatalogService.SupportsMiniMaxTextToSpeech(provider.Id))
+ {
+ await SpeakWithMiniMaxAsync(text, credentials, player);
+ return;
+ }
+
+ if (synthesizer == null)
{
throw new InvalidOperationException("Speech playback is not ready.");
}
using var stream = await synthesizer.SynthesizeTextToStreamAsync(text);
+ await PlayStreamAsync(player, stream, stream.ContentType);
+ }
+
+ private static async Task PlayStreamAsync(
+ MediaPlayer player,
+ IRandomAccessStream stream,
+ string contentType)
+ {
+ stream.Seek(0);
var playbackEnded = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
TypedEventHandler<MediaPlayer, object>? endedHandler = null;
@@ -987,7 +1025,7 @@ private async Task SpeakTextAsync(string text)
try
{
- player.Source = MediaSource.CreateFromStream(stream, stream.ContentType);
+ player.Source = MediaSource.CreateFromStream(stream, contentType);
player.Play();
await playbackEnded.Task;
}
@@ -999,6 +1037,67 @@ private async Task SpeakTextAsync(string text)
}
}
+ private async Task SpeakWithMiniMaxAsync(
+ string text,
+ VoiceProviderCredentials credentials,
+ MediaPlayer player)
+ {
+ if (string.IsNullOrWhiteSpace(credentials.MiniMaxApiKey))
+ {
+ throw new InvalidOperationException(
+ "MiniMax API key is not configured. Add VoiceProviderCredentials.MiniMaxApiKey to %APPDATA%\\OpenClawTray\\settings.json.");
+ }
+
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return;
+ }
+
+ var payload = BuildMiniMaxRequestPayload(text);
+ using var request = new HttpRequestMessage(HttpMethod.Post, MiniMaxTtsEndpoint)
+ {
+ Content = new StringContent(payload, Encoding.UTF8, "application/json")
+ };
+ request.Headers.Authorization = new AuthenticationHeaderValue("Bearer", credentials.MiniMaxApiKey);
+
+ using var response = await s_httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead);
+ var responseText = await response.Content.ReadAsStringAsync();
+
+ if (!response.IsSuccessStatusCode)
+ {
+ throw new InvalidOperationException($"MiniMax TTS request failed: {(int)response.StatusCode} {response.ReasonPhrase}");
+ }
+
+ using var document = JsonDocument.Parse(responseText);
+ var statusCode = document.RootElement
+ .GetProperty("base_resp")
+ .GetProperty("status_code")
+ .GetInt32();
+ if (statusCode != 0)
+ {
+ var statusMessage = document.RootElement
+ .GetProperty("base_resp")
+ .GetProperty("status_msg")
+ .GetString() ?? "unknown error";
+ throw new InvalidOperationException($"MiniMax TTS returned an error: {statusMessage}");
+ }
+
+ var audioHex = document.RootElement
+ .GetProperty("data")
+ .GetProperty("audio")
+ .GetString();
+ if (string.IsNullOrWhiteSpace(audioHex))
+ {
+ throw new InvalidOperationException("MiniMax TTS response did not contain audio data.");
+ }
+
+ var audioBytes = DecodeHex(audioHex);
+ using var stream = new InMemoryRandomAccessStream();
+ await stream.WriteAsync(audioBytes.AsBuffer());
+ await stream.FlushAsync();
+ await PlayStreamAsync(player, stream, "audio/mpeg");
+ }
+
private async void OnSpeechRecognitionCompleted(
SpeechContinuousRecognitionSession sender,
SpeechContinuousRecognitionCompletedEventArgs args)
@@ -1349,6 +1448,15 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
};
}
+ private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
+ {
+ return new VoiceProviderCredentials
+ {
+ MiniMaxApiKey = source.MiniMaxApiKey,
+ ElevenLabsApiKey = source.ElevenLabsApiKey
+ };
+ }
+
private static string? BuildProviderFallbackMessage(
VoiceProviderOption speechToTextProvider,
VoiceProviderOption textToSpeechProvider)
@@ -1360,7 +1468,7 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
fallbacks.Add($"STT '{speechToTextProvider.Name}' is not implemented yet; using Windows Speech Recognition.");
}
- if (!VoiceProviderCatalogService.SupportsWindowsRuntime(textToSpeechProvider.Id))
+ if (!VoiceProviderCatalogService.SupportsTextToSpeechRuntime(textToSpeechProvider.Id))
{
fallbacks.Add($"TTS '{textToSpeechProvider.Name}' is not implemented yet; using Windows Speech Synthesis.");
}
@@ -1368,6 +1476,54 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
return fallbacks.Count == 0 ? null : string.Join(" ", fallbacks);
}
+ private static HttpClient CreateHttpClient()
+ {
+ return new HttpClient
+ {
+ Timeout = TimeSpan.FromSeconds(30)
+ };
+ }
+
+ private static string BuildMiniMaxRequestPayload(string text)
+ {
+ var payload = new
+ {
+ model = MiniMaxTtsModel,
+ text,
+ stream = false,
+ language_boost = "English",
+ output_format = "hex",
+ voice_setting = new
+ {
+ voice_id = MiniMaxTtsVoiceId,
+ speed = 1,
+ vol = 1,
+ pitch = 0
+ },
+ audio_setting = new
+ {
+ sample_rate = 32000,
+ bitrate = 128000,
+ format = "mp3",
+ channel = 1
+ }
+ };
+
+ return JsonSerializer.Serialize(payload);
+ }
+
+ private static byte[] DecodeHex(string hex)
+ {
+ try
+ {
+ return Convert.FromHexString(hex);
+ }
+ catch (FormatException ex)
+ {
+ throw new InvalidOperationException("MiniMax TTS returned invalid audio data.", ex);
+ }
+ }
+
private static string GetUserFacingErrorMessage(Exception ex)
{
if (IsSpeechPrivacyDeclined(ex))
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index 63d8d27..b935cf8 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -218,13 +218,21 @@ private void UpdateProviderInfo()
details.Add($"TTS: {tts.Name}");
}
- var fallbackNotice = (stt != null && !string.Equals(stt.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase)) ||
- (tts != null && !string.Equals(tts.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
- ? " Selected non-Windows providers are saved now but will fall back to Windows until their runtime adapters are added."
+ var sttFallback = stt != null &&
+ !VoiceProviderCatalogService.SupportsWindowsRuntime(stt.Id);
+ var ttsFallback = tts != null &&
+ !VoiceProviderCatalogService.SupportsTextToSpeechRuntime(tts.Id);
+
+ var fallbackNotice = sttFallback || ttsFallback
+ ? " Unsupported provider selections fall back to Windows until their runtime adapters are added."
+ : string.Empty;
+ var credentialNotice = tts != null &&
+ string.Equals(tts.Id, VoiceProviderIds.MiniMax, StringComparison.OrdinalIgnoreCase)
+ ? " Configure VoiceProviderCredentials.MiniMaxApiKey in %APPDATA%\\OpenClawTray\\settings.json."
: string.Empty;
ProviderInfoTextBlock.Text =
- $"{string.Join(" ┬╖ ", details)}. Configure extra providers in {VoiceProviderCatalogService.CatalogFilePath}.{fallbackNotice}";
+ $"{string.Join(" ┬╖ ", details)}. Configure extra providers in {VoiceProviderCatalogService.CatalogFilePath}.{credentialNotice}{fallbackNotice}";
}
private void UpdateTroubleshooting(string? error)
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 45b489d..8b10274 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -75,4 +75,21 @@ public void VoiceProviderCatalog_Defaults_ToEmptyLists()
Assert.Empty(catalog.SpeechToTextProviders);
Assert.Empty(catalog.TextToSpeechProviders);
}
+
+ [Fact]
+ public void VoiceProviderIds_ExposeRequiredBuiltInProviders()
+ {
+ Assert.Equal("windows", VoiceProviderIds.Windows);
+ Assert.Equal("minimax", VoiceProviderIds.MiniMax);
+ Assert.Equal("elevenlabs", VoiceProviderIds.ElevenLabs);
+ }
+
+ [Fact]
+ public void VoiceProviderCredentials_Defaults_ToEmptySecrets()
+ {
+ var credentials = new VoiceProviderCredentials();
+
+ Assert.Null(credentials.MiniMaxApiKey);
+ Assert.Null(credentials.ElevenLabsApiKey);
+ }
}
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 22e8306..49b5ef7 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -57,6 +57,11 @@ public void RoundTrip_AllFields_Preserved()
ChatWindowSubmitMode = VoiceChatWindowSubmitMode.WaitForUser
}
},
+ VoiceProviderCredentials = new VoiceProviderCredentials
+ {
+ MiniMaxApiKey = "minimax-key",
+ ElevenLabsApiKey = "eleven-key"
+ },
UserRules = new List<UserNotificationRule>
{
new() { Pattern = "build.*fail", IsRegex = true, Category = "urgent", Enabled = true }
@@ -98,6 +103,9 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal(0.72f, restored.Voice.WakeWord.TriggerThreshold);
Assert.Equal(300, restored.Voice.AlwaysOn.MinSpeechMs);
Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.AlwaysOn.ChatWindowSubmitMode);
+ Assert.NotNull(restored.VoiceProviderCredentials);
+ Assert.Equal("minimax-key", restored.VoiceProviderCredentials.MiniMaxApiKey);
+ Assert.Equal("eleven-key", restored.VoiceProviderCredentials.ElevenLabsApiKey);
Assert.NotNull(restored.UserRules);
Assert.Single(restored.UserRules);
Assert.Equal("build.*fail", restored.UserRules[0].Pattern);
@@ -149,6 +157,9 @@ public void MissingFields_UseDefaults()
Assert.False(settings.Voice.ShowConversationToasts);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
+ Assert.NotNull(settings.VoiceProviderCredentials);
+ Assert.Null(settings.VoiceProviderCredentials.MiniMaxApiKey);
+ Assert.Null(settings.VoiceProviderCredentials.ElevenLabsApiKey);
Assert.Equal(16000, settings.Voice.SampleRateHz);
Assert.Equal("NanoWakeWord", settings.Voice.WakeWord.Engine);
Assert.Null(settings.UserRules);
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
new file mode 100644
index 0000000..159c06a
--- /dev/null
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -0,0 +1,25 @@
+using OpenClaw.Shared;
+using OpenClawTray.Services.Voice;
+
+namespace OpenClaw.Tray.Tests;
+
+public class VoiceProviderCatalogServiceTests
+{
+ [Fact]
+ public void LoadCatalog_IncludesBuiltInMiniMaxAndElevenLabsTtsProviders()
+ {
+ var catalog = VoiceProviderCatalogService.LoadCatalog();
+
+ Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.Windows);
+ Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
+ Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.ElevenLabs);
+ }
+
+ [Fact]
+ public void SupportsTextToSpeechRuntime_ReturnsTrueForMiniMaxOnlyWhenImplemented()
+ {
+ Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.Windows));
+ Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.MiniMax));
+ Assert.False(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.ElevenLabs));
+ }
+}
From c64f16851f1bc60f6861fbfd90b551ad67f59b9c Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 12:28:36 +0000
Subject: [PATCH 14/83] Add editable TTS provider settings to voice mode
---
docs/VOICE-MODE.md | 20 +-
src/OpenClaw.Shared/VoiceModeSchema.cs | 4 +
.../Services/Voice/VoiceService.cs | 21 +-
.../Windows/VoiceModeWindow.xaml | 19 ++
.../Windows/VoiceModeWindow.xaml.cs | 180 +++++++++++++++++-
.../VoiceModeSchemaTests.cs | 4 +
.../SettingsRoundTripTests.cs | 14 +-
7 files changed, 253 insertions(+), 9 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 04bb0b0..55856cb 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -199,7 +199,17 @@ This is an intentional short-term design choice so the Windows tray app can use
Current credential fields:
- `VoiceProviderCredentials.MiniMaxApiKey`
+- `VoiceProviderCredentials.MiniMaxModel`
+- `VoiceProviderCredentials.MiniMaxVoiceId`
- `VoiceProviderCredentials.ElevenLabsApiKey`
+- `VoiceProviderCredentials.ElevenLabsModel`
+- `VoiceProviderCredentials.ElevenLabsVoiceId`
+
+When the selected TTS provider in the Voice Mode window is not `windows`, the tray app shows provider-specific fields in the configuration form so the user can enter or edit:
+
+- API key
+- model
+- voice id
For `WakeWord`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
@@ -267,7 +277,11 @@ Provider credentials are persisted as `SettingsData.VoiceProviderCredentials` in
},
"VoiceProviderCredentials": {
"MiniMaxApiKey": "<local secret>",
- "ElevenLabsApiKey": null
+ "MiniMaxModel": "speech-2.8-turbo",
+ "MiniMaxVoiceId": "English_MatureBoss",
+ "ElevenLabsApiKey": null,
+ "ElevenLabsModel": null,
+ "ElevenLabsVoiceId": null
}
}
```
@@ -311,7 +325,11 @@ Provider credentials are persisted as `SettingsData.VoiceProviderCredentials` in
| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
| `Voice.AlwaysOn.ChatWindowSubmitMode` | enum | `AutoSend` | always-on | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
| `VoiceProviderCredentials.MiniMaxApiKey` | string? | `null` | minimax tts | API key used for MiniMax cloud TTS requests |
+| `VoiceProviderCredentials.MiniMaxModel` | string | `speech-2.8-turbo` | minimax tts | MiniMax TTS model identifier editable in the Voice Mode form |
+| `VoiceProviderCredentials.MiniMaxVoiceId` | string | `English_MatureBoss` | minimax tts | MiniMax TTS voice id editable in the Voice Mode form |
| `VoiceProviderCredentials.ElevenLabsApiKey` | string? | `null` | elevenlabs tts | Reserved for the required ElevenLabs TTS implementation |
+| `VoiceProviderCredentials.ElevenLabsModel` | string? | `null` | elevenlabs tts | Reserved for future ElevenLabs model selection in the Voice Mode form |
+| `VoiceProviderCredentials.ElevenLabsVoiceId` | string? | `null` | elevenlabs tts | Reserved for future ElevenLabs voice selection in the Voice Mode form |
At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index d9a85fd..bc1aadb 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -143,7 +143,11 @@ public static class VoiceProviderIds
public sealed class VoiceProviderCredentials
{
public string? MiniMaxApiKey { get; set; }
+ public string MiniMaxModel { get; set; } = "speech-2.8-turbo";
+ public string MiniMaxVoiceId { get; set; } = "English_MatureBoss";
public string? ElevenLabsApiKey { get; set; }
+ public string? ElevenLabsModel { get; set; }
+ public string? ElevenLabsVoiceId { get; set; }
}
public sealed class VoiceProviderOption
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 1e52934..ebbad9b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1053,7 +1053,14 @@ private async Task SpeakWithMiniMaxAsync(
return;
}
- var payload = BuildMiniMaxRequestPayload(text);
+ var model = string.IsNullOrWhiteSpace(credentials.MiniMaxModel)
+ ? MiniMaxTtsModel
+ : credentials.MiniMaxModel.Trim();
+ var voiceId = string.IsNullOrWhiteSpace(credentials.MiniMaxVoiceId)
+ ? MiniMaxTtsVoiceId
+ : credentials.MiniMaxVoiceId.Trim();
+
+ var payload = BuildMiniMaxRequestPayload(text, model, voiceId);
using var request = new HttpRequestMessage(HttpMethod.Post, MiniMaxTtsEndpoint)
{
Content = new StringContent(payload, Encoding.UTF8, "application/json")
@@ -1453,7 +1460,11 @@ private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
return new VoiceProviderCredentials
{
MiniMaxApiKey = source.MiniMaxApiKey,
- ElevenLabsApiKey = source.ElevenLabsApiKey
+ MiniMaxModel = source.MiniMaxModel,
+ MiniMaxVoiceId = source.MiniMaxVoiceId,
+ ElevenLabsApiKey = source.ElevenLabsApiKey,
+ ElevenLabsModel = source.ElevenLabsModel,
+ ElevenLabsVoiceId = source.ElevenLabsVoiceId
};
}
@@ -1484,18 +1495,18 @@ private static HttpClient CreateHttpClient()
};
}
- private static string BuildMiniMaxRequestPayload(string text)
+ private static string BuildMiniMaxRequestPayload(string text, string model, string voiceId)
{
var payload = new
{
- model = MiniMaxTtsModel,
+ model,
text,
stream = false,
language_boost = "English",
output_format = "hex",
voice_setting = new
{
- voice_id = MiniMaxTtsVoiceId,
+ voice_id = voiceId,
speed = 1,
vol = 1,
pitch = 0
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
index 01b8ec3..c00f62d 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
@@ -41,6 +41,25 @@
<TextBlock Text="PROVIDERS" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
<ComboBox x:Name="SpeechToTextProviderComboBox" Header="Speech to text provider" DisplayMemberPath="Name" SelectionChanged="OnProviderChanged"/>
<ComboBox x:Name="TextToSpeechProviderComboBox" Header="Text to speech provider" DisplayMemberPath="Name" SelectionChanged="OnProviderChanged"/>
+ <StackPanel x:Name="TtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
+ <TextBlock x:Name="TtsProviderSettingsTitleTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="#E74C3C"
+ FontWeight="Bold"/>
+ <PasswordBox x:Name="TtsApiKeyPasswordBox"
+ Header="API key"
+ PasswordChanged="OnProviderSettingsChanged"/>
+ <TextBox x:Name="TtsModelTextBox"
+ Header="Model"
+ TextChanged="OnProviderSettingsChanged"/>
+ <TextBox x:Name="TtsVoiceIdTextBox"
+ Header="Voice ID"
+ TextChanged="OnProviderSettingsChanged"/>
+ <TextBlock Text="These values are stored in your local tray settings file and are not read from the provider catalog."
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </StackPanel>
<TextBlock x:Name="ProviderInfoTextBlock"
Style="{StaticResource CaptionTextBlockStyle}"
Foreground="{ThemeResource TextFillColorSecondaryBrush}"
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index b935cf8..dc58d27 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -17,6 +17,9 @@ public sealed partial class VoiceModeWindow : WindowEx
{
private readonly SettingsManager _settings;
private readonly VoiceService _voiceService;
+ private VoiceProviderCredentials _providerCredentialsDraft = new();
+ private string _activeTtsProviderId = VoiceProviderIds.Windows;
+ private bool _updatingProviderFields;
private List<ProviderOption> _speechToTextOptions = new();
private List<ProviderOption> _textToSpeechOptions = new();
private List<DeviceOption> _inputOptions = new();
@@ -44,11 +47,13 @@ public VoiceModeWindow(SettingsManager settings, VoiceService voiceService)
private void LoadSettings()
{
+ _providerCredentialsDraft = Clone(_settings.VoiceProviderCredentials);
LoadProviders();
SelectMode(_settings.Voice.Mode);
SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
UpdateModeInfo();
+ UpdateProviderSettingsEditor();
UpdateProviderInfo();
StatusTextBlock.Text = BuildStatusText();
}
@@ -227,14 +232,162 @@ private void UpdateProviderInfo()
? " Unsupported provider selections fall back to Windows until their runtime adapters are added."
: string.Empty;
var credentialNotice = tts != null &&
- string.Equals(tts.Id, VoiceProviderIds.MiniMax, StringComparison.OrdinalIgnoreCase)
- ? " Configure VoiceProviderCredentials.MiniMaxApiKey in %APPDATA%\\OpenClawTray\\settings.json."
+ !string.Equals(tts.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase)
+ ? " Configure the selected provider below; values are stored in your local tray settings."
: string.Empty;
ProviderInfoTextBlock.Text =
$"{string.Join(" ┬╖ ", details)}. Configure extra providers in {VoiceProviderCatalogService.CatalogFilePath}.{credentialNotice}{fallbackNotice}";
}
+ private void UpdateProviderSettingsEditor()
+ {
+ var providerId = GetSelectedTextToSpeechProviderId();
+ var showProviderSettings = !string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
+
+ TtsProviderSettingsPanel.Visibility = showProviderSettings ? Visibility.Visible : Visibility.Collapsed;
+ if (!showProviderSettings)
+ {
+ _activeTtsProviderId = VoiceProviderIds.Windows;
+ return;
+ }
+
+ _updatingProviderFields = true;
+ try
+ {
+ TtsProviderSettingsTitleTextBlock.Text = $"{GetSelectedTextToSpeechProviderName().ToUpperInvariant()} SETTINGS";
+ TtsApiKeyPasswordBox.Password = GetProviderApiKey(providerId) ?? string.Empty;
+ TtsModelTextBox.Text = GetProviderModel(providerId);
+ TtsVoiceIdTextBox.Text = GetProviderVoiceId(providerId);
+ _activeTtsProviderId = providerId;
+ }
+ finally
+ {
+ _updatingProviderFields = false;
+ }
+ }
+
+ private string GetSelectedTextToSpeechProviderId()
+ {
+ return (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows;
+ }
+
+ private string GetSelectedTextToSpeechProviderName()
+ {
+ return (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Provider";
+ }
+
+ private void CaptureSelectedProviderSettings()
+ {
+ if (_updatingProviderFields)
+ {
+ return;
+ }
+
+ var providerId = _activeTtsProviderId;
+ if (string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ {
+ return;
+ }
+
+ SetProviderApiKey(providerId, TtsApiKeyPasswordBox.Password);
+ SetProviderModel(providerId, TtsModelTextBox.Text);
+ SetProviderVoiceId(providerId, TtsVoiceIdTextBox.Text);
+ }
+
+ private string? GetProviderApiKey(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _providerCredentialsDraft.MiniMaxApiKey,
+ VoiceProviderIds.ElevenLabs => _providerCredentialsDraft.ElevenLabsApiKey,
+ _ => null
+ };
+ }
+
+ private string GetProviderModel(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _providerCredentialsDraft.MiniMaxModel,
+ VoiceProviderIds.ElevenLabs => _providerCredentialsDraft.ElevenLabsModel ?? string.Empty,
+ _ => string.Empty
+ };
+ }
+
+ private string GetProviderVoiceId(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _providerCredentialsDraft.MiniMaxVoiceId,
+ VoiceProviderIds.ElevenLabs => _providerCredentialsDraft.ElevenLabsVoiceId ?? string.Empty,
+ _ => string.Empty
+ };
+ }
+
+ private void SetProviderApiKey(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? null : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _providerCredentialsDraft.MiniMaxApiKey = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _providerCredentialsDraft.ElevenLabsApiKey = normalized;
+ break;
+ }
+ }
+
+ private void SetProviderModel(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultModel(providerId) : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _providerCredentialsDraft.MiniMaxModel = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _providerCredentialsDraft.ElevenLabsModel = normalized;
+ break;
+ }
+ }
+
+ private void SetProviderVoiceId(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultVoiceId(providerId) : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _providerCredentialsDraft.MiniMaxVoiceId = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _providerCredentialsDraft.ElevenLabsVoiceId = normalized;
+ break;
+ }
+ }
+
+ private static string GetDefaultModel(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => "speech-2.8-turbo",
+ _ => string.Empty
+ };
+ }
+
+ private static string GetDefaultVoiceId(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => "English_MatureBoss",
+ _ => string.Empty
+ };
+ }
+
private void UpdateTroubleshooting(string? error)
{
TroubleshootingPanel.Visibility = Visibility.Collapsed;
@@ -278,10 +431,17 @@ private void OnModeChanged(object sender, SelectionChangedEventArgs e)
private void OnProviderChanged(object sender, SelectionChangedEventArgs e)
{
+ CaptureSelectedProviderSettings();
+ UpdateProviderSettingsEditor();
UpdateProviderInfo();
StatusTextBlock.Text = BuildStatusText();
}
+ private void OnProviderSettingsChanged(object sender, RoutedEventArgs e)
+ {
+ CaptureSelectedProviderSettings();
+ }
+
private void OnOpenSpeechSettings(object sender, RoutedEventArgs e)
{
OpenSettingsUri("ms-settings:privacy-speech");
@@ -294,6 +454,8 @@ private void OnOpenMicrophoneSettings(object sender, RoutedEventArgs e)
private async void OnSave(object sender, RoutedEventArgs e)
{
+ CaptureSelectedProviderSettings();
+
var updated = new VoiceSettings
{
Mode = GetSelectedMode(),
@@ -326,6 +488,7 @@ private async void OnSave(object sender, RoutedEventArgs e)
try
{
+ _settings.VoiceProviderCredentials = Clone(_providerCredentialsDraft);
await _voiceService.UpdateSettingsAsync(new VoiceSettingsUpdateArgs
{
Settings = updated,
@@ -368,6 +531,19 @@ private static void OpenSettingsUri(string uri)
}
}
+ private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
+ {
+ return new VoiceProviderCredentials
+ {
+ MiniMaxApiKey = source.MiniMaxApiKey,
+ MiniMaxModel = source.MiniMaxModel,
+ MiniMaxVoiceId = source.MiniMaxVoiceId,
+ ElevenLabsApiKey = source.ElevenLabsApiKey,
+ ElevenLabsModel = source.ElevenLabsModel,
+ ElevenLabsVoiceId = source.ElevenLabsVoiceId
+ };
+ }
+
private sealed record DeviceOption(string? DeviceId, string Name);
private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
}
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 8b10274..4ec1308 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -90,6 +90,10 @@ public void VoiceProviderCredentials_Defaults_ToEmptySecrets()
var credentials = new VoiceProviderCredentials();
Assert.Null(credentials.MiniMaxApiKey);
+ Assert.Equal("speech-2.8-turbo", credentials.MiniMaxModel);
+ Assert.Equal("English_MatureBoss", credentials.MiniMaxVoiceId);
Assert.Null(credentials.ElevenLabsApiKey);
+ Assert.Null(credentials.ElevenLabsModel);
+ Assert.Null(credentials.ElevenLabsVoiceId);
}
}
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 49b5ef7..1cc3f0d 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -60,7 +60,11 @@ public void RoundTrip_AllFields_Preserved()
VoiceProviderCredentials = new VoiceProviderCredentials
{
MiniMaxApiKey = "minimax-key",
- ElevenLabsApiKey = "eleven-key"
+ MiniMaxModel = "speech-2.8-turbo",
+ MiniMaxVoiceId = "English_MatureBoss",
+ ElevenLabsApiKey = "eleven-key",
+ ElevenLabsModel = "eleven-v3",
+ ElevenLabsVoiceId = "voice-42"
},
UserRules = new List<UserNotificationRule>
{
@@ -105,7 +109,11 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.AlwaysOn.ChatWindowSubmitMode);
Assert.NotNull(restored.VoiceProviderCredentials);
Assert.Equal("minimax-key", restored.VoiceProviderCredentials.MiniMaxApiKey);
+ Assert.Equal("speech-2.8-turbo", restored.VoiceProviderCredentials.MiniMaxModel);
+ Assert.Equal("English_MatureBoss", restored.VoiceProviderCredentials.MiniMaxVoiceId);
Assert.Equal("eleven-key", restored.VoiceProviderCredentials.ElevenLabsApiKey);
+ Assert.Equal("eleven-v3", restored.VoiceProviderCredentials.ElevenLabsModel);
+ Assert.Equal("voice-42", restored.VoiceProviderCredentials.ElevenLabsVoiceId);
Assert.NotNull(restored.UserRules);
Assert.Single(restored.UserRules);
Assert.Equal("build.*fail", restored.UserRules[0].Pattern);
@@ -159,7 +167,11 @@ public void MissingFields_UseDefaults()
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
Assert.NotNull(settings.VoiceProviderCredentials);
Assert.Null(settings.VoiceProviderCredentials.MiniMaxApiKey);
+ Assert.Equal("speech-2.8-turbo", settings.VoiceProviderCredentials.MiniMaxModel);
+ Assert.Equal("English_MatureBoss", settings.VoiceProviderCredentials.MiniMaxVoiceId);
Assert.Null(settings.VoiceProviderCredentials.ElevenLabsApiKey);
+ Assert.Null(settings.VoiceProviderCredentials.ElevenLabsModel);
+ Assert.Null(settings.VoiceProviderCredentials.ElevenLabsVoiceId);
Assert.Equal(16000, settings.Voice.SampleRateHz);
Assert.Equal("NanoWakeWord", settings.Voice.WakeWord.Engine);
Assert.Null(settings.UserRules);
From 907a1a0d37e6d13195f777c20ec72e6ab3106793 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 12:58:53 +0000
Subject: [PATCH 15/83] Move voice settings into main settings window
---
docs/VOICE-MODE.md | 27 +-
src/OpenClaw.Tray.WinUI/App.xaml.cs | 24 +-
.../Services/Voice/VoiceDisplayHelper.cs | 50 ++
.../Windows/SettingsWindow.xaml | 53 ++
.../Windows/SettingsWindow.xaml.cs | 428 +++++++++++++++-
.../Windows/VoiceModeWindow.xaml | 119 ++---
.../Windows/VoiceModeWindow.xaml.cs | 478 ++----------------
7 files changed, 634 insertions(+), 545 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 55856cb..18a9f0b 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -6,9 +6,7 @@ This document defines the voice subsystem for the Windows node only. It introduc
- Add a node-local voice mode with two activation modes: `wakeword` and `alwaysOn`
- Use NanoWakeWord for wakeword detection on-device
-- Provide parity targets with the macOS app:
- - `WakeWord` maps to Voice Wake
- - `AlwaysOn` maps to Talk Mode
+- Present the user-facing mode names as `Voice Wake` and `Talk Mode`
- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
- Reuse the existing node capability pattern instead of introducing a parallel control path
@@ -33,21 +31,21 @@ OpenClaw remains responsible for conversation/session routing and upstream voice
This keeps the Windows node lean for the first implementation and avoids introducing provider-routing settings before they are needed.
-## macOS Parity Mapping
+## Visible Mode Names
-Windows voice mode aims for functional parity with the existing macOS voice surfaces:
+The tray app now uses user-facing names rather than exposing the internal enum names directly:
-| Windows Mode | macOS Equivalent | Behavior |
+| Internal Mode | Visible Name | Availability |
|---|---|---|
-| `WakeWord` | Voice Wake | passively listen for a trigger phrase, capture one utterance, then submit after end silence |
-| `AlwaysOn` | Talk Mode | continuous listen -> think -> speak loop with barge-in support, while still remaining turn-based rather than true simultaneous duplex audio |
+| `Off` | Off | available |
+| `WakeWord` | Voice Wake | visible but disabled for now |
+| `AlwaysOn` | Talk Mode | available |
-For v1 on Windows, `AlwaysOn` is Talk Mode parity, not a new full-duplex transport.
-The current implementation is still turn-based: listen, send transcript, wait, speak, resume listening.
+Internally the contracts and persisted settings still use `WakeWord` and `AlwaysOn`.
## Transport Boundary
-For macOS parity, `AlwaysOn` should follow Talk Mode's documented control flow:
+`AlwaysOn` follows a talk-mode style control flow:
- the node captures audio locally
- local speech recognition turns that audio into transcript text
@@ -56,7 +54,7 @@ For macOS parity, `AlwaysOn` should follow Talk Mode's documented control flow:
- OpenClaw returns the assistant reply as normal chat output
- the node performs local TTS playback of that reply
-That means the first Windows parity target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
+That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `AlwaysOn`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
@@ -246,6 +244,9 @@ These contracts are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/Voice
Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src/OpenClaw.Shared/SettingsData.cs).
Provider credentials are persisted as `SettingsData.VoiceProviderCredentials` in the same local settings file.
+The editable voice configuration now lives in the main Settings window.
+The tray `Voice Mode` window is a read-only runtime status/detail surface with a shortcut back into Settings.
+
### Effective Schema
```json
@@ -445,7 +446,7 @@ sequenceDiagram
- `WindowsNodeClient` remains the gateway/node transport
- existing node capability registration remains the integration pattern
- current request/response transport remains the v1 control plane
-- `AlwaysOn` parity should reuse existing `chat.send` message flow instead of inventing an audio-upload protocol
+- `AlwaysOn` should reuse existing `chat.send` message flow instead of inventing an audio-upload protocol
### New Components Expected Later
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 9d69256..9cb8914 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -746,22 +746,7 @@ private string GetRunningVoiceModeLabel()
return "Off";
}
- if (status.State == VoiceRuntimeState.Paused)
- {
- return $"{status.Mode} (Paused)";
- }
-
- if (status.Running)
- {
- return status.Mode switch
- {
- VoiceActivationMode.WakeWord => "WakeWord",
- VoiceActivationMode.AlwaysOn => "AlwaysOn",
- _ => "Off"
- };
- }
-
- return "Off";
+ return VoiceDisplayHelper.GetRuntimeLabel(status);
}
private bool CanQuickToggleVoiceMode()
@@ -1670,9 +1655,12 @@ private void UpdateTrayIcon()
private void ShowSettings()
{
+ if (_settings == null || _voiceService == null)
+ return;
+
if (_settingsWindow == null || _settingsWindow.IsClosed)
{
- _settingsWindow = new SettingsWindow(_settings!);
+ _settingsWindow = new SettingsWindow(_settings, _voiceService);
_settingsWindow.Closed += (s, e) =>
{
_settingsWindow.SettingsSaved -= OnSettingsSaved;
@@ -1691,9 +1679,11 @@ private void ShowVoiceModeSettings()
if (_voiceModeWindow == null || _voiceModeWindow.IsClosed)
{
_voiceModeWindow = new VoiceModeWindow(_settings, _voiceService);
+ _voiceModeWindow.OpenSettingsRequested += (s, e) => ShowSettings();
_voiceModeWindow.Closed += (s, e) => _voiceModeWindow = null;
}
+ _voiceModeWindow.RefreshStatus();
_voiceModeWindow.Activate();
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
new file mode 100644
index 0000000..3921863
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
@@ -0,0 +1,50 @@
+using OpenClaw.Shared;
+
+namespace OpenClawTray.Services.Voice;
+
+public static class VoiceDisplayHelper
+{
+ public static string GetModeLabel(VoiceActivationMode mode)
+ {
+ return mode switch
+ {
+ VoiceActivationMode.WakeWord => "Voice Wake",
+ VoiceActivationMode.AlwaysOn => "Talk Mode",
+ _ => "Off"
+ };
+ }
+
+ public static string GetStateLabel(VoiceRuntimeState state)
+ {
+ return state switch
+ {
+ VoiceRuntimeState.Arming => "Starting",
+ VoiceRuntimeState.ListeningForWakeWord => "Listening",
+ VoiceRuntimeState.ListeningContinuously => "Listening",
+ VoiceRuntimeState.RecordingUtterance => "Recording",
+ VoiceRuntimeState.SubmittingAudio => "Sending",
+ VoiceRuntimeState.PendingManualSend => "Waiting for send",
+ VoiceRuntimeState.AwaitingResponse => "Waiting for reply",
+ VoiceRuntimeState.PlayingResponse => "Speaking",
+ VoiceRuntimeState.Paused => "Paused",
+ VoiceRuntimeState.Error => "Error",
+ VoiceRuntimeState.Idle => "Idle",
+ _ => "Stopped"
+ };
+ }
+
+ public static string GetRuntimeLabel(VoiceStatusInfo status)
+ {
+ if (status.State == VoiceRuntimeState.Paused)
+ {
+ return $"{GetModeLabel(status.Mode)} (Paused)";
+ }
+
+ if (status.Running)
+ {
+ return $"{GetModeLabel(status.Mode)} ({GetStateLabel(status.State)})";
+ }
+
+ return "Off";
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
index f8631f5..75359d1 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
@@ -83,6 +83,59 @@
<!-- Test Notification -->
<Button x:Name="TestNotificationButton" x:Uid="SettingsTestNotificationButton" Content="Send Test Notification"
Click="OnTestNotification"/>
+
+ <StackPanel Spacing="8">
+ <TextBlock Text="VOICE" Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="#E74C3C" FontWeight="Bold"/>
+
+ <ComboBox x:Name="VoiceModeComboBox" Header="Mode" SelectionChanged="OnVoiceModeChanged">
+ <ComboBoxItem Content="Off" Tag="Off"/>
+ <ComboBoxItem Content="Voice Wake" Tag="WakeWord" IsEnabled="False"/>
+ <ComboBoxItem Content="Talk Mode" Tag="AlwaysOn"/>
+ </ComboBox>
+
+ <ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
+ Header="Speech to text provider"
+ DisplayMemberPath="Name"
+ SelectionChanged="OnVoiceProviderChanged"/>
+ <ComboBox x:Name="VoiceTextToSpeechProviderComboBox"
+ Header="Text to speech provider"
+ DisplayMemberPath="Name"
+ SelectionChanged="OnVoiceProviderChanged"/>
+
+ <StackPanel x:Name="VoiceTtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
+ <TextBlock x:Name="VoiceTtsProviderSettingsTitleTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="#E74C3C"
+ FontWeight="Bold"/>
+ <PasswordBox x:Name="VoiceTtsApiKeyPasswordBox"
+ Header="API key"
+ PasswordChanged="OnVoiceProviderSettingsChanged"/>
+ <TextBox x:Name="VoiceTtsModelTextBox"
+ Header="Model"
+ TextChanged="OnVoiceProviderSettingsChanged"/>
+ <TextBox x:Name="VoiceTtsVoiceIdTextBox"
+ Header="Voice ID"
+ TextChanged="OnVoiceProviderSettingsChanged"/>
+ </StackPanel>
+
+ <ComboBox x:Name="VoiceInputDeviceComboBox" Header="Listen device (microphone)" DisplayMemberPath="Name"/>
+ <ComboBox x:Name="VoiceOutputDeviceComboBox" Header="Talk device (speaker)" DisplayMemberPath="Name"/>
+ <Button Content="Refresh voice devices" HorizontalAlignment="Left" Click="OnRefreshVoiceDevices"/>
+
+ <CheckBox x:Name="VoiceConversationToastsCheckBox"
+ Content="Show voice transcripts and replies as toasts"/>
+
+ <ComboBox x:Name="VoiceChatWindowSubmitModeComboBox" Header="When the tray chat window is open">
+ <ComboBoxItem Content="Send automatically" Tag="AutoSend"/>
+ <ComboBoxItem Content="Fill message box and wait for me to send" Tag="WaitForUser"/>
+ </ComboBox>
+
+ <TextBlock x:Name="VoiceSettingsInfoTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </StackPanel>
<!-- Advanced Section -->
<StackPanel Spacing="8">
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
index e4224a8..b08a262 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
@@ -1,9 +1,13 @@
using Microsoft.Toolkit.Uwp.Notifications;
using Microsoft.UI.Xaml;
+using Microsoft.UI.Xaml.Controls;
using OpenClaw.Shared;
using OpenClawTray.Helpers;
using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
using System;
+using System.Collections.Generic;
+using System.Linq;
using System.Threading.Tasks;
using WinUIEx;
@@ -12,26 +16,36 @@ namespace OpenClawTray.Windows;
public sealed partial class SettingsWindow : WindowEx
{
private readonly SettingsManager _settings;
+ private readonly VoiceService _voiceService;
+ private VoiceProviderCredentials _voiceProviderCredentialsDraft = new();
+ private string _activeTtsProviderId = VoiceProviderIds.Windows;
+ private bool _updatingVoiceProviderFields;
+ private List<ProviderOption> _speechToTextOptions = new();
+ private List<ProviderOption> _textToSpeechOptions = new();
+ private List<DeviceOption> _inputOptions = new();
+ private List<DeviceOption> _outputOptions = new();
+
public bool IsClosed { get; private set; }
public event EventHandler? SettingsSaved;
- public SettingsWindow(SettingsManager settings)
+ public SettingsWindow(SettingsManager settings, VoiceService voiceService)
{
_settings = settings;
+ _voiceService = voiceService;
InitializeComponent();
-
+
Title = LocalizationHelper.GetString("WindowTitle_Settings");
-
- // Window configuration
- this.SetWindowSize(480, 700);
+
+ this.SetWindowSize(560, 860);
this.CenterOnScreen();
this.SetIcon(IconHelper.GetStatusIconPath(ConnectionStatus.Connected));
-
+
LoadSettings();
-
+ _ = LoadVoiceDevicesAsync();
+
Closed += (s, e) => IsClosed = true;
-
+
Logger.Info("[Settings] Window opened");
}
@@ -42,11 +56,10 @@ private void LoadSettings()
AutoStartToggle.IsOn = _settings.AutoStart;
GlobalHotkeyToggle.IsOn = _settings.GlobalHotkeyEnabled;
NotificationsToggle.IsOn = _settings.ShowNotifications;
-
- // Set sound combo ΓÇö match by Tag (stable persistence key), not Content (display text)
+
for (int i = 0; i < NotificationSoundComboBox.Items.Count; i++)
{
- if (NotificationSoundComboBox.Items[i] is Microsoft.UI.Xaml.Controls.ComboBoxItem item &&
+ if (NotificationSoundComboBox.Items[i] is ComboBoxItem item &&
item.Tag?.ToString() == _settings.NotificationSound)
{
NotificationSoundComboBox.SelectedIndex = i;
@@ -54,9 +67,10 @@ private void LoadSettings()
}
}
if (NotificationSoundComboBox.SelectedIndex < 0)
+ {
NotificationSoundComboBox.SelectedIndex = 0;
+ }
- // Notification filters
NotifyHealthCb.IsChecked = _settings.NotifyHealth;
NotifyUrgentCb.IsChecked = _settings.NotifyUrgent;
NotifyReminderCb.IsChecked = _settings.NotifyReminder;
@@ -65,9 +79,80 @@ private void LoadSettings()
NotifyBuildCb.IsChecked = _settings.NotifyBuild;
NotifyStockCb.IsChecked = _settings.NotifyStock;
NotifyInfoCb.IsChecked = _settings.NotifyInfo;
-
- // Advanced
+
NodeModeToggle.IsOn = _settings.EnableNodeMode;
+
+ LoadVoiceSettings();
+ }
+
+ private void LoadVoiceSettings()
+ {
+ _voiceProviderCredentialsDraft = Clone(_settings.VoiceProviderCredentials);
+ LoadVoiceProviders();
+ SelectVoiceMode(_settings.Voice.Mode);
+ SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
+ VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
+ UpdateVoiceProviderSettingsEditor();
+ UpdateVoiceSettingsInfo();
+ }
+
+ private void LoadVoiceProviders()
+ {
+ var catalog = _voiceService.GetProviderCatalog();
+
+ _speechToTextOptions = catalog.SpeechToTextProviders
+ .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .ToList();
+ _textToSpeechOptions = catalog.TextToSpeechProviders
+ .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .ToList();
+
+ VoiceSpeechToTextProviderComboBox.ItemsSource = _speechToTextOptions;
+ VoiceTextToSpeechProviderComboBox.ItemsSource = _textToSpeechOptions;
+
+ VoiceSpeechToTextProviderComboBox.SelectedItem =
+ _speechToTextOptions.FirstOrDefault(p => p.Id == _settings.Voice.SpeechToTextProviderId)
+ ?? _speechToTextOptions.FirstOrDefault();
+ VoiceTextToSpeechProviderComboBox.SelectedItem =
+ _textToSpeechOptions.FirstOrDefault(p => p.Id == _settings.Voice.TextToSpeechProviderId)
+ ?? _textToSpeechOptions.FirstOrDefault();
+ }
+
+ private async Task LoadVoiceDevicesAsync()
+ {
+ try
+ {
+ VoiceSettingsInfoTextBlock.Text = "Loading voice devices...";
+ var devices = await _voiceService.ListDevicesAsync();
+
+ _inputOptions =
+ [
+ new DeviceOption(null, "System default microphone")
+ ];
+ _inputOptions.AddRange(devices
+ .Where(d => d.IsInput)
+ .Select(d => new DeviceOption(d.DeviceId, d.Name)));
+
+ _outputOptions =
+ [
+ new DeviceOption(null, "System default speaker")
+ ];
+ _outputOptions.AddRange(devices
+ .Where(d => d.IsOutput)
+ .Select(d => new DeviceOption(d.DeviceId, d.Name)));
+
+ VoiceInputDeviceComboBox.ItemsSource = _inputOptions;
+ VoiceOutputDeviceComboBox.ItemsSource = _outputOptions;
+
+ VoiceInputDeviceComboBox.SelectedItem = _inputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.InputDeviceId) ?? _inputOptions[0];
+ VoiceOutputDeviceComboBox.SelectedItem = _outputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.OutputDeviceId) ?? _outputOptions[0];
+
+ UpdateVoiceSettingsInfo();
+ }
+ catch (Exception ex)
+ {
+ VoiceSettingsInfoTextBlock.Text = $"Failed to load voice devices: {ex.Message}";
+ }
}
private void SaveSettings()
@@ -77,8 +162,8 @@ private void SaveSettings()
_settings.AutoStart = AutoStartToggle.IsOn;
_settings.GlobalHotkeyEnabled = GlobalHotkeyToggle.IsOn;
_settings.ShowNotifications = NotificationsToggle.IsOn;
-
- if (NotificationSoundComboBox.SelectedItem is Microsoft.UI.Xaml.Controls.ComboBoxItem item)
+
+ if (NotificationSoundComboBox.SelectedItem is ComboBoxItem item)
{
_settings.NotificationSound = item.Tag?.ToString() ?? "Default";
}
@@ -91,14 +176,289 @@ private void SaveSettings()
_settings.NotifyBuild = NotifyBuildCb.IsChecked ?? true;
_settings.NotifyStock = NotifyStockCb.IsChecked ?? true;
_settings.NotifyInfo = NotifyInfoCb.IsChecked ?? true;
-
- // Advanced
_settings.EnableNodeMode = NodeModeToggle.IsOn;
+ CaptureSelectedVoiceProviderSettings();
+
+ _settings.Voice = new VoiceSettings
+ {
+ Mode = GetSelectedVoiceMode(),
+ Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
+ ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
+ SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ InputDeviceId = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
+ OutputDeviceId = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
+ SampleRateHz = _settings.Voice.SampleRateHz,
+ CaptureChunkMs = _settings.Voice.CaptureChunkMs,
+ BargeInEnabled = _settings.Voice.BargeInEnabled,
+ WakeWord = new VoiceWakeWordSettings
+ {
+ Engine = _settings.Voice.WakeWord.Engine,
+ ModelId = _settings.Voice.WakeWord.ModelId,
+ TriggerThreshold = _settings.Voice.WakeWord.TriggerThreshold,
+ TriggerCooldownMs = _settings.Voice.WakeWord.TriggerCooldownMs,
+ PreRollMs = _settings.Voice.WakeWord.PreRollMs,
+ EndSilenceMs = _settings.Voice.WakeWord.EndSilenceMs
+ },
+ AlwaysOn = new VoiceAlwaysOnSettings
+ {
+ MinSpeechMs = _settings.Voice.AlwaysOn.MinSpeechMs,
+ EndSilenceMs = _settings.Voice.AlwaysOn.EndSilenceMs,
+ MaxUtteranceMs = _settings.Voice.AlwaysOn.MaxUtteranceMs,
+ ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
+ }
+ };
+ _settings.VoiceProviderCredentials = Clone(_voiceProviderCredentialsDraft);
+
_settings.Save();
AutoStartManager.SetAutoStart(_settings.AutoStart);
}
+ private void SelectVoiceMode(VoiceActivationMode mode)
+ {
+ var target = mode switch
+ {
+ VoiceActivationMode.WakeWord => "WakeWord",
+ VoiceActivationMode.AlwaysOn => "AlwaysOn",
+ _ => "Off"
+ };
+
+ foreach (var item in VoiceModeComboBox.Items.OfType<ComboBoxItem>())
+ {
+ if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
+ {
+ VoiceModeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ VoiceModeComboBox.SelectedIndex = 0;
+ }
+
+ private VoiceActivationMode GetSelectedVoiceMode()
+ {
+ var tag = (VoiceModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
+ return tag switch
+ {
+ "WakeWord" => VoiceActivationMode.WakeWord,
+ "AlwaysOn" => VoiceActivationMode.AlwaysOn,
+ _ => VoiceActivationMode.Off
+ };
+ }
+
+ private void SelectChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
+ {
+ var target = mode == VoiceChatWindowSubmitMode.WaitForUser ? "WaitForUser" : "AutoSend";
+
+ foreach (var item in VoiceChatWindowSubmitModeComboBox.Items.OfType<ComboBoxItem>())
+ {
+ if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
+ {
+ VoiceChatWindowSubmitModeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ VoiceChatWindowSubmitModeComboBox.SelectedIndex = 0;
+ }
+
+ private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
+ {
+ var tag = (VoiceChatWindowSubmitModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
+ return tag == "WaitForUser"
+ ? VoiceChatWindowSubmitMode.WaitForUser
+ : VoiceChatWindowSubmitMode.AutoSend;
+ }
+
+ private void UpdateVoiceSettingsInfo()
+ {
+ var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
+ var tts = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
+ var input = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default microphone";
+ var output = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default speaker";
+ var fallbackNotice = string.Empty;
+
+ if (VoiceTextToSpeechProviderComboBox.SelectedItem is ProviderOption ttsOption &&
+ !VoiceProviderCatalogService.SupportsTextToSpeechRuntime(ttsOption.Id))
+ {
+ fallbackNotice = " Unsupported TTS providers will fall back to Windows until their runtime adapters are added.";
+ }
+
+ VoiceSettingsInfoTextBlock.Text =
+ $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}.{fallbackNotice}";
+ }
+
+ private void UpdateVoiceProviderSettingsEditor()
+ {
+ var providerId = GetSelectedTextToSpeechProviderId();
+ var showProviderSettings = !string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
+
+ VoiceTtsProviderSettingsPanel.Visibility = showProviderSettings ? Visibility.Visible : Visibility.Collapsed;
+ if (!showProviderSettings)
+ {
+ _activeTtsProviderId = VoiceProviderIds.Windows;
+ return;
+ }
+
+ _updatingVoiceProviderFields = true;
+ try
+ {
+ VoiceTtsProviderSettingsTitleTextBlock.Text = $"{GetSelectedTextToSpeechProviderName().ToUpperInvariant()} SETTINGS";
+ VoiceTtsApiKeyPasswordBox.Password = GetProviderApiKey(providerId) ?? string.Empty;
+ VoiceTtsModelTextBox.Text = GetProviderModel(providerId);
+ VoiceTtsVoiceIdTextBox.Text = GetProviderVoiceId(providerId);
+ _activeTtsProviderId = providerId;
+ }
+ finally
+ {
+ _updatingVoiceProviderFields = false;
+ }
+ }
+
+ private string GetSelectedTextToSpeechProviderId()
+ {
+ return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows;
+ }
+
+ private string GetSelectedTextToSpeechProviderName()
+ {
+ return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Provider";
+ }
+
+ private void CaptureSelectedVoiceProviderSettings()
+ {
+ if (_updatingVoiceProviderFields)
+ {
+ return;
+ }
+
+ var providerId = _activeTtsProviderId;
+ if (string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ {
+ return;
+ }
+
+ SetProviderApiKey(providerId, VoiceTtsApiKeyPasswordBox.Password);
+ SetProviderModel(providerId, VoiceTtsModelTextBox.Text);
+ SetProviderVoiceId(providerId, VoiceTtsVoiceIdTextBox.Text);
+ }
+
+ private string? GetProviderApiKey(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxApiKey,
+ VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsApiKey,
+ _ => null
+ };
+ }
+
+ private string GetProviderModel(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxModel,
+ VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsModel ?? string.Empty,
+ _ => string.Empty
+ };
+ }
+
+ private string GetProviderVoiceId(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxVoiceId,
+ VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsVoiceId ?? string.Empty,
+ _ => string.Empty
+ };
+ }
+
+ private void SetProviderApiKey(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? null : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _voiceProviderCredentialsDraft.MiniMaxApiKey = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _voiceProviderCredentialsDraft.ElevenLabsApiKey = normalized;
+ break;
+ }
+ }
+
+ private void SetProviderModel(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultModel(providerId) : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _voiceProviderCredentialsDraft.MiniMaxModel = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _voiceProviderCredentialsDraft.ElevenLabsModel = normalized;
+ break;
+ }
+ }
+
+ private void SetProviderVoiceId(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultVoiceId(providerId) : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _voiceProviderCredentialsDraft.MiniMaxVoiceId = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _voiceProviderCredentialsDraft.ElevenLabsVoiceId = normalized;
+ break;
+ }
+ }
+
+ private static string GetDefaultModel(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => "speech-2.8-turbo",
+ _ => string.Empty
+ };
+ }
+
+ private static string GetDefaultVoiceId(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => "English_MatureBoss",
+ _ => string.Empty
+ };
+ }
+
+ private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
+ {
+ await LoadVoiceDevicesAsync();
+ }
+
+ private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
+ {
+ UpdateVoiceSettingsInfo();
+ }
+
+ private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
+ {
+ CaptureSelectedVoiceProviderSettings();
+ UpdateVoiceProviderSettingsEditor();
+ UpdateVoiceSettingsInfo();
+ }
+
+ private void OnVoiceProviderSettingsChanged(object sender, RoutedEventArgs e)
+ {
+ CaptureSelectedVoiceProviderSettings();
+ }
+
private async void OnTestConnection(object sender, RoutedEventArgs e)
{
var gatewayUrl = GatewayUrlTextBox.Text.Trim();
@@ -122,7 +482,7 @@ private async void OnTestConnection(object sender, RoutedEventArgs e)
var connected = false;
var tcs = new TaskCompletionSource<bool>();
-
+
client.StatusChanged += (s, status) =>
{
if (status == ConnectionStatus.Connected)
@@ -137,8 +497,7 @@ private async void OnTestConnection(object sender, RoutedEventArgs e)
};
_ = client.ConnectAsync();
-
- // Wait up to 5 seconds for connection
+
var completedTask = await Task.WhenAny(tcs.Task, Task.Delay(5000));
if (completedTask != tcs.Task)
{
@@ -191,23 +550,28 @@ private void OnSave(object sender, RoutedEventArgs e)
var gatewayUrl = GatewayUrlTextBox.Text.Trim();
if (!GatewayUrlHelper.IsValidGatewayUrl(gatewayUrl))
{
- Logger.Warn($"[Settings] Save blocked ΓÇö invalid gateway URL");
+ Logger.Warn("[Settings] Save blocked ΓÇö invalid gateway URL");
StatusLabel.Text = $"❌ {GatewayUrlHelper.ValidationMessage}";
return;
}
- // Log key setting changes before saving
var oldGateway = _settings.GatewayUrl;
var oldAutoStart = _settings.AutoStart;
var oldNodeMode = _settings.EnableNodeMode;
SaveSettings();
if (!string.Equals(oldGateway, _settings.GatewayUrl, StringComparison.Ordinal))
- Logger.Info($"[Settings] GatewayUrl changed");
+ {
+ Logger.Info("[Settings] GatewayUrl changed");
+ }
if (oldAutoStart != _settings.AutoStart)
+ {
Logger.Info($"[Settings] AutoStart changed to {_settings.AutoStart}");
+ }
if (oldNodeMode != _settings.EnableNodeMode)
+ {
Logger.Info($"[Settings] NodeMode changed to {_settings.EnableNodeMode}");
+ }
Logger.Info("[Settings] Settings saved");
SettingsSaved?.Invoke(this, EventArgs.Empty);
@@ -220,6 +584,22 @@ private void OnCancel(object sender, RoutedEventArgs e)
Close();
}
+ private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
+ {
+ return new VoiceProviderCredentials
+ {
+ MiniMaxApiKey = source.MiniMaxApiKey,
+ MiniMaxModel = source.MiniMaxModel,
+ MiniMaxVoiceId = source.MiniMaxVoiceId,
+ ElevenLabsApiKey = source.ElevenLabsApiKey,
+ ElevenLabsModel = source.ElevenLabsModel,
+ ElevenLabsVoiceId = source.ElevenLabsVoiceId
+ };
+ }
+
+ private sealed record DeviceOption(string? DeviceId, string Name);
+ private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
+
private class TestLogger : IOpenClawLogger
{
public string? LastError { get; private set; }
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
index c00f62d..90ed291 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml
@@ -22,80 +22,70 @@
<StackPanel Spacing="24" MaxWidth="500">
<StackPanel Spacing="8">
<TextBlock Text="VOICE MODE" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
- <TextBlock Text="Windows voice mode targets macOS parity: Always On maps to Talk Mode, and WakeWord maps to Voice Wake." TextWrapping="Wrap"/>
- </StackPanel>
-
- <StackPanel Spacing="8">
- <ComboBox x:Name="ModeComboBox" Header="Mode" SelectionChanged="OnModeChanged">
- <ComboBoxItem Content="Off" Tag="Off"/>
- <ComboBoxItem Content="WakeWord" Tag="WakeWord"/>
- <ComboBoxItem Content="AlwaysOn" Tag="AlwaysOn"/>
- </ComboBox>
-
- <InfoBar x:Name="ModeInfoBar" IsOpen="True" Severity="Informational" IsClosable="False"
- Title="Implementation status"
- Message="AlwaysOn is the current runtime target. WakeWord settings can be configured now, and NanoWakeWord activation will follow in a later step."/>
- </StackPanel>
-
- <StackPanel Spacing="8">
- <TextBlock Text="PROVIDERS" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
- <ComboBox x:Name="SpeechToTextProviderComboBox" Header="Speech to text provider" DisplayMemberPath="Name" SelectionChanged="OnProviderChanged"/>
- <ComboBox x:Name="TextToSpeechProviderComboBox" Header="Text to speech provider" DisplayMemberPath="Name" SelectionChanged="OnProviderChanged"/>
- <StackPanel x:Name="TtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
- <TextBlock x:Name="TtsProviderSettingsTitleTextBlock"
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="#E74C3C"
- FontWeight="Bold"/>
- <PasswordBox x:Name="TtsApiKeyPasswordBox"
- Header="API key"
- PasswordChanged="OnProviderSettingsChanged"/>
- <TextBox x:Name="TtsModelTextBox"
- Header="Model"
- TextChanged="OnProviderSettingsChanged"/>
- <TextBox x:Name="TtsVoiceIdTextBox"
- Header="Voice ID"
- TextChanged="OnProviderSettingsChanged"/>
- <TextBlock Text="These values are stored in your local tray settings file and are not read from the provider catalog."
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="{ThemeResource TextFillColorSecondaryBrush}"
- TextWrapping="Wrap"/>
- </StackPanel>
- <TextBlock x:Name="ProviderInfoTextBlock"
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ <TextBlock Text="Current voice runtime status and configuration summary."
TextWrapping="Wrap"/>
</StackPanel>
<StackPanel Spacing="8">
- <TextBlock Text="DEVICES" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
- <ComboBox x:Name="InputDeviceComboBox" Header="Listen device (microphone)" DisplayMemberPath="Name"/>
- <ComboBox x:Name="OutputDeviceComboBox" Header="Talk device (speaker)" DisplayMemberPath="Name"/>
- <Button Content="Refresh devices" HorizontalAlignment="Left" Click="OnRefreshDevices"/>
+ <TextBlock Text="STATUS" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <ItemsControl x:Name="StatusItemsControl">
+ <ItemsControl.ItemTemplate>
+ <DataTemplate>
+ <Grid Padding="8" Margin="0,2" CornerRadius="4"
+ Background="{ThemeResource CardBackgroundFillColorDefaultBrush}">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="150"/>
+ <ColumnDefinition Width="*"/>
+ </Grid.ColumnDefinitions>
+ <TextBlock Text="{Binding Label}" Grid.Column="0"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
+ <TextBlock Text="{Binding Value}" Grid.Column="1" TextWrapping="Wrap"/>
+ </Grid>
+ </DataTemplate>
+ </ItemsControl.ItemTemplate>
+ </ItemsControl>
</StackPanel>
<StackPanel Spacing="8">
- <TextBlock Text="NOTIFICATIONS" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
- <CheckBox x:Name="VoiceConversationToastsCheckBox"
- Content="Show voice transcripts and replies as toasts"/>
+ <TextBlock Text="CONFIGURATION" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <ItemsControl x:Name="ConfigurationItemsControl">
+ <ItemsControl.ItemTemplate>
+ <DataTemplate>
+ <Grid Padding="8" Margin="0,2" CornerRadius="4"
+ Background="{ThemeResource CardBackgroundFillColorDefaultBrush}">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="150"/>
+ <ColumnDefinition Width="*"/>
+ </Grid.ColumnDefinitions>
+ <TextBlock Text="{Binding Label}" Grid.Column="0"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
+ <TextBlock Text="{Binding Value}" Grid.Column="1" TextWrapping="Wrap"/>
+ </Grid>
+ </DataTemplate>
+ </ItemsControl.ItemTemplate>
+ </ItemsControl>
</StackPanel>
<StackPanel Spacing="8">
- <TextBlock Text="CHAT WINDOW" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
- <ComboBox x:Name="ChatWindowSubmitModeComboBox" Header="When the tray chat window is open">
- <ComboBoxItem Content="Send automatically" Tag="AutoSend"/>
- <ComboBoxItem Content="Fill message box and wait for me to send" Tag="WaitForUser"/>
- </ComboBox>
- <TextBlock Text="This setting only applies while the tray chat window is open. In windowless voice mode, utterances are always sent automatically."
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="{ThemeResource TextFillColorSecondaryBrush}"
- TextWrapping="Wrap"/>
+ <TextBlock Text="RECENT" Style="{StaticResource CaptionTextBlockStyle}" Foreground="#E74C3C" FontWeight="Bold"/>
+ <ItemsControl x:Name="RecentItemsControl">
+ <ItemsControl.ItemTemplate>
+ <DataTemplate>
+ <Grid Padding="8" Margin="0,2" CornerRadius="4"
+ Background="{ThemeResource CardBackgroundFillColorDefaultBrush}">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="150"/>
+ <ColumnDefinition Width="*"/>
+ </Grid.ColumnDefinitions>
+ <TextBlock Text="{Binding Label}" Grid.Column="0"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
+ <TextBlock Text="{Binding Value}" Grid.Column="1" TextWrapping="Wrap"/>
+ </Grid>
+ </DataTemplate>
+ </ItemsControl.ItemTemplate>
+ </ItemsControl>
</StackPanel>
- <TextBlock x:Name="StatusTextBlock"
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="{ThemeResource TextFillColorSecondaryBrush}"
- TextWrapping="Wrap"/>
-
<StackPanel x:Name="TroubleshootingPanel" Spacing="8" Visibility="Collapsed">
<TextBlock x:Name="TroubleshootingTextBlock"
Style="{StaticResource CaptionTextBlockStyle}"
@@ -121,8 +111,9 @@
BorderThickness="0,1,0,0"
Padding="24,16">
<StackPanel Orientation="Horizontal" Spacing="8" HorizontalAlignment="Right">
- <Button Content="Cancel" Click="OnCancel" Width="80"/>
- <Button Content="Save" Click="OnSave" Width="80" Style="{ThemeResource AccentButtonStyle}"/>
+ <Button Content="Refresh" Click="OnRefresh" Width="90"/>
+ <Button Content="Settings" Click="OnOpenSettings" Width="90"/>
+ <Button Content="Close" Click="OnClose" Width="90" Style="{ThemeResource AccentButtonStyle}"/>
</StackPanel>
</Border>
</Grid>
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index dc58d27..062f261 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -1,5 +1,4 @@
using Microsoft.UI.Xaml;
-using Microsoft.UI.Xaml.Controls;
using OpenClaw.Shared;
using OpenClawTray.Helpers;
using OpenClawTray.Services;
@@ -7,8 +6,6 @@
using System;
using System.Collections.Generic;
using System.Diagnostics;
-using System.Linq;
-using System.Threading.Tasks;
using WinUIEx;
namespace OpenClawTray.Windows;
@@ -17,16 +14,11 @@ public sealed partial class VoiceModeWindow : WindowEx
{
private readonly SettingsManager _settings;
private readonly VoiceService _voiceService;
- private VoiceProviderCredentials _providerCredentialsDraft = new();
- private string _activeTtsProviderId = VoiceProviderIds.Windows;
- private bool _updatingProviderFields;
- private List<ProviderOption> _speechToTextOptions = new();
- private List<ProviderOption> _textToSpeechOptions = new();
- private List<DeviceOption> _inputOptions = new();
- private List<DeviceOption> _outputOptions = new();
public bool IsClosed { get; private set; }
+ public event EventHandler? OpenSettingsRequested;
+
public VoiceModeWindow(SettingsManager settings, VoiceService voiceService)
{
_settings = settings;
@@ -41,351 +33,74 @@ public VoiceModeWindow(SettingsManager settings, VoiceService voiceService)
Closed += (s, e) => IsClosed = true;
- LoadSettings();
- _ = LoadDevicesAsync();
- }
-
- private void LoadSettings()
- {
- _providerCredentialsDraft = Clone(_settings.VoiceProviderCredentials);
- LoadProviders();
- SelectMode(_settings.Voice.Mode);
- SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
- VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
- UpdateModeInfo();
- UpdateProviderSettingsEditor();
- UpdateProviderInfo();
- StatusTextBlock.Text = BuildStatusText();
- }
-
- private void LoadProviders()
- {
- var catalog = _voiceService.GetProviderCatalog();
-
- _speechToTextOptions = catalog.SpeechToTextProviders
- .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
- .ToList();
- _textToSpeechOptions = catalog.TextToSpeechProviders
- .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
- .ToList();
-
- SpeechToTextProviderComboBox.ItemsSource = _speechToTextOptions;
- TextToSpeechProviderComboBox.ItemsSource = _textToSpeechOptions;
-
- SpeechToTextProviderComboBox.SelectedItem =
- _speechToTextOptions.FirstOrDefault(p => p.Id == _settings.Voice.SpeechToTextProviderId)
- ?? _speechToTextOptions.FirstOrDefault();
- TextToSpeechProviderComboBox.SelectedItem =
- _textToSpeechOptions.FirstOrDefault(p => p.Id == _settings.Voice.TextToSpeechProviderId)
- ?? _textToSpeechOptions.FirstOrDefault();
- }
-
- private async Task LoadDevicesAsync()
- {
- try
- {
- StatusTextBlock.Text = "Loading audio devices...";
- var devices = await _voiceService.ListDevicesAsync();
-
- _inputOptions =
- [
- new DeviceOption(null, "System default microphone")
- ];
- _inputOptions.AddRange(devices
- .Where(d => d.IsInput)
- .Select(d => new DeviceOption(d.DeviceId, d.Name)));
-
- _outputOptions =
- [
- new DeviceOption(null, "System default speaker")
- ];
- _outputOptions.AddRange(devices
- .Where(d => d.IsOutput)
- .Select(d => new DeviceOption(d.DeviceId, d.Name)));
-
- InputDeviceComboBox.ItemsSource = _inputOptions;
- OutputDeviceComboBox.ItemsSource = _outputOptions;
-
- InputDeviceComboBox.SelectedItem = _inputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.InputDeviceId) ?? _inputOptions[0];
- OutputDeviceComboBox.SelectedItem = _outputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.OutputDeviceId) ?? _outputOptions[0];
-
- StatusTextBlock.Text = BuildStatusText();
- }
- catch (Exception ex)
- {
- StatusTextBlock.Text = $"Failed to load devices: {ex.Message}";
- }
- }
-
- private void SelectMode(VoiceActivationMode mode)
- {
- var target = mode switch
- {
- VoiceActivationMode.WakeWord => "WakeWord",
- VoiceActivationMode.AlwaysOn => "AlwaysOn",
- _ => "Off"
- };
-
- foreach (var item in ModeComboBox.Items.OfType<ComboBoxItem>())
- {
- if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
- {
- ModeComboBox.SelectedItem = item;
- return;
- }
- }
-
- ModeComboBox.SelectedIndex = 0;
- }
-
- private VoiceActivationMode GetSelectedMode()
- {
- var tag = (ModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
- return tag switch
- {
- "WakeWord" => VoiceActivationMode.WakeWord,
- "AlwaysOn" => VoiceActivationMode.AlwaysOn,
- _ => VoiceActivationMode.Off
- };
- }
-
- private void SelectChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
- {
- var target = mode == VoiceChatWindowSubmitMode.WaitForUser ? "WaitForUser" : "AutoSend";
-
- foreach (var item in ChatWindowSubmitModeComboBox.Items.OfType<ComboBoxItem>())
- {
- if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
- {
- ChatWindowSubmitModeComboBox.SelectedItem = item;
- return;
- }
- }
-
- ChatWindowSubmitModeComboBox.SelectedIndex = 0;
- }
-
- private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
- {
- var tag = (ChatWindowSubmitModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
- return tag == "WaitForUser"
- ? VoiceChatWindowSubmitMode.WaitForUser
- : VoiceChatWindowSubmitMode.AutoSend;
- }
-
- private string BuildStatusText()
- {
- var running = _voiceService.CurrentStatus;
- var runtime = running.State == VoiceRuntimeState.Paused
- ? $"{running.Mode} ({running.State})"
- : running.Running
- ? $"{running.Mode} ({running.State})"
- : "Off";
- var nodeMode = _settings.EnableNodeMode ? "enabled" : "disabled";
- var stt = (SpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
- var tts = (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
- var error = string.IsNullOrWhiteSpace(running.LastError)
- ? string.Empty
- : $" Last issue: {running.LastError}.";
- UpdateTroubleshooting(running.LastError);
- return $"Runtime: {runtime}. Node Mode is {nodeMode}. STT: {stt}. TTS: {tts}.{error}";
+ RefreshStatus();
}
public void RefreshStatus()
{
- StatusTextBlock.Text = BuildStatusText();
- }
+ var running = _voiceService.CurrentStatus;
+ var catalog = _voiceService.GetProviderCatalog();
- private void UpdateModeInfo()
- {
- var mode = GetSelectedMode();
- ModeInfoBar.Message = mode switch
+ StatusItemsControl.ItemsSource = new List<DetailRow>
{
- VoiceActivationMode.WakeWord => "WakeWord settings are saved now, but NanoWakeWord activation is still the next implementation step.",
- VoiceActivationMode.AlwaysOn => "AlwaysOn is the first active runtime target. It uses Windows speech recognition and turn-based reply playback today.",
- _ => "Voice runtime stays off until you choose a listening mode."
+ new("Mode", VoiceDisplayHelper.GetModeLabel(_settings.Voice.Mode)),
+ new("Runtime", VoiceDisplayHelper.GetRuntimeLabel(running)),
+ new("Node Mode", _settings.EnableNodeMode ? "Enabled" : "Disabled"),
+ new("Session", string.IsNullOrWhiteSpace(running.SessionKey) ? "main" : running.SessionKey!),
+ new("State", VoiceDisplayHelper.GetStateLabel(running.State))
};
- }
-
- private void UpdateProviderInfo()
- {
- var stt = SpeechToTextProviderComboBox.SelectedItem as ProviderOption;
- var tts = TextToSpeechProviderComboBox.SelectedItem as ProviderOption;
-
- var details = new List<string>();
- if (stt != null)
- {
- details.Add($"STT: {stt.Name}");
- }
-
- if (tts != null)
- {
- details.Add($"TTS: {tts.Name}");
- }
-
- var sttFallback = stt != null &&
- !VoiceProviderCatalogService.SupportsWindowsRuntime(stt.Id);
- var ttsFallback = tts != null &&
- !VoiceProviderCatalogService.SupportsTextToSpeechRuntime(tts.Id);
-
- var fallbackNotice = sttFallback || ttsFallback
- ? " Unsupported provider selections fall back to Windows until their runtime adapters are added."
- : string.Empty;
- var credentialNotice = tts != null &&
- !string.Equals(tts.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase)
- ? " Configure the selected provider below; values are stored in your local tray settings."
- : string.Empty;
-
- ProviderInfoTextBlock.Text =
- $"{string.Join(" ┬╖ ", details)}. Configure extra providers in {VoiceProviderCatalogService.CatalogFilePath}.{credentialNotice}{fallbackNotice}";
- }
-
- private void UpdateProviderSettingsEditor()
- {
- var providerId = GetSelectedTextToSpeechProviderId();
- var showProviderSettings = !string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
-
- TtsProviderSettingsPanel.Visibility = showProviderSettings ? Visibility.Visible : Visibility.Collapsed;
- if (!showProviderSettings)
- {
- _activeTtsProviderId = VoiceProviderIds.Windows;
- return;
- }
-
- _updatingProviderFields = true;
- try
- {
- TtsProviderSettingsTitleTextBlock.Text = $"{GetSelectedTextToSpeechProviderName().ToUpperInvariant()} SETTINGS";
- TtsApiKeyPasswordBox.Password = GetProviderApiKey(providerId) ?? string.Empty;
- TtsModelTextBox.Text = GetProviderModel(providerId);
- TtsVoiceIdTextBox.Text = GetProviderVoiceId(providerId);
- _activeTtsProviderId = providerId;
- }
- finally
- {
- _updatingProviderFields = false;
- }
- }
- private string GetSelectedTextToSpeechProviderId()
- {
- return (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows;
- }
-
- private string GetSelectedTextToSpeechProviderName()
- {
- return (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Provider";
- }
-
- private void CaptureSelectedProviderSettings()
- {
- if (_updatingProviderFields)
- {
- return;
- }
-
- var providerId = _activeTtsProviderId;
- if (string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ ConfigurationItemsControl.ItemsSource = new List<DetailRow>
{
- return;
- }
-
- SetProviderApiKey(providerId, TtsApiKeyPasswordBox.Password);
- SetProviderModel(providerId, TtsModelTextBox.Text);
- SetProviderVoiceId(providerId, TtsVoiceIdTextBox.Text);
- }
-
- private string? GetProviderApiKey(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _providerCredentialsDraft.MiniMaxApiKey,
- VoiceProviderIds.ElevenLabs => _providerCredentialsDraft.ElevenLabsApiKey,
- _ => null
+ new("Speech to text", ResolveProviderName(catalog.SpeechToTextProviders, _settings.Voice.SpeechToTextProviderId, "Windows Speech Recognition")),
+ new("Text to speech", ResolveProviderName(catalog.TextToSpeechProviders, _settings.Voice.TextToSpeechProviderId, "Windows Speech Synthesis")),
+ new("Listen device", DescribeDevice(_settings.Voice.InputDeviceId, "System default microphone")),
+ new("Talk device", DescribeDevice(_settings.Voice.OutputDeviceId, "System default speaker")),
+ new("Chat window", DescribeChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode)),
+ new("Voice toasts", _settings.Voice.ShowConversationToasts ? "Enabled" : "Disabled")
};
- }
- private string GetProviderModel(string providerId)
- {
- return providerId switch
+ RecentItemsControl.ItemsSource = new List<DetailRow>
{
- VoiceProviderIds.MiniMax => _providerCredentialsDraft.MiniMaxModel,
- VoiceProviderIds.ElevenLabs => _providerCredentialsDraft.ElevenLabsModel ?? string.Empty,
- _ => string.Empty
+ new("Last utterance", FormatTimestamp(running.LastUtteranceUtc)),
+ new("Last wake", FormatTimestamp(running.LastWakeWordUtc)),
+ new("Last issue", string.IsNullOrWhiteSpace(running.LastError) ? "None" : running.LastError!)
};
- }
- private string GetProviderVoiceId(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _providerCredentialsDraft.MiniMaxVoiceId,
- VoiceProviderIds.ElevenLabs => _providerCredentialsDraft.ElevenLabsVoiceId ?? string.Empty,
- _ => string.Empty
- };
+ UpdateTroubleshooting(running.LastError);
}
- private void SetProviderApiKey(string providerId, string? value)
+ private static string ResolveProviderName(
+ IReadOnlyList<VoiceProviderOption> providers,
+ string? providerId,
+ string fallback)
{
- var normalized = string.IsNullOrWhiteSpace(value) ? null : value.Trim();
-
- switch (providerId)
+ foreach (var provider in providers)
{
- case VoiceProviderIds.MiniMax:
- _providerCredentialsDraft.MiniMaxApiKey = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _providerCredentialsDraft.ElevenLabsApiKey = normalized;
- break;
+ if (string.Equals(provider.Id, providerId, StringComparison.OrdinalIgnoreCase))
+ {
+ return provider.Name;
+ }
}
- }
-
- private void SetProviderModel(string providerId, string? value)
- {
- var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultModel(providerId) : value.Trim();
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _providerCredentialsDraft.MiniMaxModel = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _providerCredentialsDraft.ElevenLabsModel = normalized;
- break;
- }
+ return fallback;
}
- private void SetProviderVoiceId(string providerId, string? value)
+ private static string DescribeDevice(string? deviceId, string defaultLabel)
{
- var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultVoiceId(providerId) : value.Trim();
-
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _providerCredentialsDraft.MiniMaxVoiceId = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _providerCredentialsDraft.ElevenLabsVoiceId = normalized;
- break;
- }
+ return string.IsNullOrWhiteSpace(deviceId) ? defaultLabel : "Selected device";
}
- private static string GetDefaultModel(string providerId)
+ private static string DescribeChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
{
- return providerId switch
- {
- VoiceProviderIds.MiniMax => "speech-2.8-turbo",
- _ => string.Empty
- };
+ return mode == VoiceChatWindowSubmitMode.WaitForUser
+ ? "Fill message box and wait for send"
+ : "Send automatically";
}
- private static string GetDefaultVoiceId(string providerId)
+ private static string FormatTimestamp(DateTime? value)
{
- return providerId switch
- {
- VoiceProviderIds.MiniMax => "English_MatureBoss",
- _ => string.Empty
- };
+ return value?.ToLocalTime().ToString("HH:mm:ss") ?? "None";
}
private void UpdateTroubleshooting(string? error)
@@ -405,7 +120,7 @@ private void UpdateTroubleshooting(string? error)
TroubleshootingPanel.Visibility = Visibility.Visible;
OpenSpeechSettingsButton.Visibility = Visibility.Visible;
TroubleshootingTextBlock.Text =
- "To fix this: open Windows Settings, go to Privacy & security > Speech, turn on Online speech recognition, then restart Voice Mode.";
+ "To fix this: open Windows Settings, go to Privacy & security > Speech, turn on Online speech recognition, then restart voice mode.";
return;
}
@@ -414,34 +129,10 @@ private void UpdateTroubleshooting(string? error)
TroubleshootingPanel.Visibility = Visibility.Visible;
OpenMicrophoneSettingsButton.Visibility = Visibility.Visible;
TroubleshootingTextBlock.Text =
- "To fix this: open Windows Settings, go to Privacy & security > Microphone, allow microphone access and enable desktop app access, then restart Voice Mode.";
+ "To fix this: open Windows Settings, go to Privacy & security > Microphone, allow microphone access and enable desktop app access, then restart voice mode.";
}
}
- private async void OnRefreshDevices(object sender, RoutedEventArgs e)
- {
- await LoadDevicesAsync();
- }
-
- private void OnModeChanged(object sender, SelectionChangedEventArgs e)
- {
- UpdateModeInfo();
- StatusTextBlock.Text = BuildStatusText();
- }
-
- private void OnProviderChanged(object sender, SelectionChangedEventArgs e)
- {
- CaptureSelectedProviderSettings();
- UpdateProviderSettingsEditor();
- UpdateProviderInfo();
- StatusTextBlock.Text = BuildStatusText();
- }
-
- private void OnProviderSettingsChanged(object sender, RoutedEventArgs e)
- {
- CaptureSelectedProviderSettings();
- }
-
private void OnOpenSpeechSettings(object sender, RoutedEventArgs e)
{
OpenSettingsUri("ms-settings:privacy-speech");
@@ -452,70 +143,17 @@ private void OnOpenMicrophoneSettings(object sender, RoutedEventArgs e)
OpenSettingsUri("ms-settings:privacy-microphone");
}
- private async void OnSave(object sender, RoutedEventArgs e)
+ private void OnRefresh(object sender, RoutedEventArgs e)
{
- CaptureSelectedProviderSettings();
-
- var updated = new VoiceSettings
- {
- Mode = GetSelectedMode(),
- Enabled = GetSelectedMode() != VoiceActivationMode.Off,
- ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
- SpeechToTextProviderId = (SpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
- TextToSpeechProviderId = (TextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
- InputDeviceId = (InputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
- OutputDeviceId = (OutputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
- SampleRateHz = _settings.Voice.SampleRateHz,
- CaptureChunkMs = _settings.Voice.CaptureChunkMs,
- BargeInEnabled = _settings.Voice.BargeInEnabled,
- WakeWord = new VoiceWakeWordSettings
- {
- Engine = _settings.Voice.WakeWord.Engine,
- ModelId = _settings.Voice.WakeWord.ModelId,
- TriggerThreshold = _settings.Voice.WakeWord.TriggerThreshold,
- TriggerCooldownMs = _settings.Voice.WakeWord.TriggerCooldownMs,
- PreRollMs = _settings.Voice.WakeWord.PreRollMs,
- EndSilenceMs = _settings.Voice.WakeWord.EndSilenceMs
- },
- AlwaysOn = new VoiceAlwaysOnSettings
- {
- MinSpeechMs = _settings.Voice.AlwaysOn.MinSpeechMs,
- EndSilenceMs = _settings.Voice.AlwaysOn.EndSilenceMs,
- MaxUtteranceMs = _settings.Voice.AlwaysOn.MaxUtteranceMs,
- ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
- }
- };
-
- try
- {
- _settings.VoiceProviderCredentials = Clone(_providerCredentialsDraft);
- await _voiceService.UpdateSettingsAsync(new VoiceSettingsUpdateArgs
- {
- Settings = updated,
- Persist = true
- });
-
- if (_settings.EnableNodeMode)
- {
- if (updated.Mode == VoiceActivationMode.Off)
- {
- await _voiceService.StopAsync(new VoiceStopArgs { Reason = "Voice mode disabled by user" });
- }
- else
- {
- await _voiceService.StartAsync(new VoiceStartArgs { Mode = updated.Mode });
- }
- }
+ RefreshStatus();
+ }
- Close();
- }
- catch (Exception ex)
- {
- StatusTextBlock.Text = $"Failed to save voice settings: {ex.Message}";
- }
+ private void OnOpenSettings(object sender, RoutedEventArgs e)
+ {
+ OpenSettingsRequested?.Invoke(this, EventArgs.Empty);
}
- private void OnCancel(object sender, RoutedEventArgs e)
+ private void OnClose(object sender, RoutedEventArgs e)
{
Close();
}
@@ -531,19 +169,5 @@ private static void OpenSettingsUri(string uri)
}
}
- private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
- {
- return new VoiceProviderCredentials
- {
- MiniMaxApiKey = source.MiniMaxApiKey,
- MiniMaxModel = source.MiniMaxModel,
- MiniMaxVoiceId = source.MiniMaxVoiceId,
- ElevenLabsApiKey = source.ElevenLabsApiKey,
- ElevenLabsModel = source.ElevenLabsModel,
- ElevenLabsVoiceId = source.ElevenLabsVoiceId
- };
- }
-
- private sealed record DeviceOption(string? DeviceId, string Name);
- private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
+ private sealed record DetailRow(string Label, string Value);
}
From 6dba89bbf80d6c3c1426112fad9df3eeaaf6ab6d Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 13:09:56 +0000
Subject: [PATCH 16/83] Extract hosted voice settings panel from settings
window
---
.../Controls/VoiceSettingsPanel.xaml | 59 +++
.../Controls/VoiceSettingsPanel.xaml.cs | 414 ++++++++++++++++++
.../Windows/SettingsWindow.xaml | 54 +--
.../Windows/SettingsWindow.xaml.cs | 378 +---------------
4 files changed, 477 insertions(+), 428 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
create mode 100644 src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
new file mode 100644
index 0000000..181a816
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -0,0 +1,59 @@
+<?xml version="1.0" encoding="utf-8"?>
+<UserControl
+ x:Class="OpenClawTray.Controls.VoiceSettingsPanel"
+ xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
+ xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
+
+ <StackPanel Spacing="8">
+ <TextBlock Text="VOICE" Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="#E74C3C" FontWeight="Bold"/>
+
+ <ComboBox x:Name="VoiceModeComboBox" Header="Mode" SelectionChanged="OnVoiceModeChanged">
+ <ComboBoxItem Content="Off" Tag="Off"/>
+ <ComboBoxItem Content="Voice Wake" Tag="WakeWord" IsEnabled="False"/>
+ <ComboBoxItem Content="Talk Mode" Tag="AlwaysOn"/>
+ </ComboBox>
+
+ <ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
+ Header="Speech to text provider"
+ DisplayMemberPath="Name"
+ SelectionChanged="OnVoiceProviderChanged"/>
+ <ComboBox x:Name="VoiceTextToSpeechProviderComboBox"
+ Header="Text to speech provider"
+ DisplayMemberPath="Name"
+ SelectionChanged="OnVoiceProviderChanged"/>
+
+ <StackPanel x:Name="VoiceTtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
+ <TextBlock x:Name="VoiceTtsProviderSettingsTitleTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="#E74C3C"
+ FontWeight="Bold"/>
+ <PasswordBox x:Name="VoiceTtsApiKeyPasswordBox"
+ Header="API key"
+ PasswordChanged="OnVoiceProviderSettingsChanged"/>
+ <TextBox x:Name="VoiceTtsModelTextBox"
+ Header="Model"
+ TextChanged="OnVoiceProviderSettingsChanged"/>
+ <TextBox x:Name="VoiceTtsVoiceIdTextBox"
+ Header="Voice ID"
+ TextChanged="OnVoiceProviderSettingsChanged"/>
+ </StackPanel>
+
+ <ComboBox x:Name="VoiceInputDeviceComboBox" Header="Listen device (microphone)" DisplayMemberPath="Name"/>
+ <ComboBox x:Name="VoiceOutputDeviceComboBox" Header="Talk device (speaker)" DisplayMemberPath="Name"/>
+ <Button Content="Refresh voice devices" HorizontalAlignment="Left" Click="OnRefreshVoiceDevices"/>
+
+ <CheckBox x:Name="VoiceConversationToastsCheckBox"
+ Content="Show voice transcripts and replies as toasts"/>
+
+ <ComboBox x:Name="VoiceChatWindowSubmitModeComboBox" Header="When the tray chat window is open">
+ <ComboBoxItem Content="Send automatically" Tag="AutoSend"/>
+ <ComboBoxItem Content="Fill message box and wait for me to send" Tag="WaitForUser"/>
+ </ComboBox>
+
+ <TextBlock x:Name="VoiceSettingsInfoTextBlock"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </StackPanel>
+</UserControl>
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
new file mode 100644
index 0000000..5c658e8
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -0,0 +1,414 @@
+using Microsoft.UI.Xaml;
+using Microsoft.UI.Xaml.Controls;
+using OpenClaw.Shared;
+using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
+using System;
+using System.Collections.Generic;
+using System.Linq;
+using System.Threading.Tasks;
+
+namespace OpenClawTray.Controls;
+
+public sealed partial class VoiceSettingsPanel : UserControl
+{
+ private SettingsManager? _settings;
+ private VoiceService? _voiceService;
+ private VoiceProviderCredentials _voiceProviderCredentialsDraft = new();
+ private string _activeTtsProviderId = VoiceProviderIds.Windows;
+ private bool _updatingVoiceProviderFields;
+ private List<ProviderOption> _speechToTextOptions = new();
+ private List<ProviderOption> _textToSpeechOptions = new();
+ private List<DeviceOption> _inputOptions = new();
+ private List<DeviceOption> _outputOptions = new();
+
+ public VoiceSettingsPanel()
+ {
+ InitializeComponent();
+ }
+
+ public void Initialize(SettingsManager settings, VoiceService voiceService)
+ {
+ _settings = settings;
+ _voiceService = voiceService;
+
+ LoadVoiceSettings();
+ _ = LoadVoiceDevicesAsync();
+ }
+
+ public void ApplyTo(SettingsManager settings)
+ {
+ CaptureSelectedVoiceProviderSettings();
+
+ settings.Voice = new VoiceSettings
+ {
+ Mode = GetSelectedVoiceMode(),
+ Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
+ ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
+ SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ InputDeviceId = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
+ OutputDeviceId = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
+ SampleRateHz = settings.Voice.SampleRateHz,
+ CaptureChunkMs = settings.Voice.CaptureChunkMs,
+ BargeInEnabled = settings.Voice.BargeInEnabled,
+ WakeWord = new VoiceWakeWordSettings
+ {
+ Engine = settings.Voice.WakeWord.Engine,
+ ModelId = settings.Voice.WakeWord.ModelId,
+ TriggerThreshold = settings.Voice.WakeWord.TriggerThreshold,
+ TriggerCooldownMs = settings.Voice.WakeWord.TriggerCooldownMs,
+ PreRollMs = settings.Voice.WakeWord.PreRollMs,
+ EndSilenceMs = settings.Voice.WakeWord.EndSilenceMs
+ },
+ AlwaysOn = new VoiceAlwaysOnSettings
+ {
+ MinSpeechMs = settings.Voice.AlwaysOn.MinSpeechMs,
+ EndSilenceMs = settings.Voice.AlwaysOn.EndSilenceMs,
+ MaxUtteranceMs = settings.Voice.AlwaysOn.MaxUtteranceMs,
+ ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
+ }
+ };
+ settings.VoiceProviderCredentials = Clone(_voiceProviderCredentialsDraft);
+ }
+
+ private void LoadVoiceSettings()
+ {
+ if (_settings == null || _voiceService == null)
+ {
+ return;
+ }
+
+ _voiceProviderCredentialsDraft = Clone(_settings.VoiceProviderCredentials);
+ LoadVoiceProviders();
+ SelectVoiceMode(_settings.Voice.Mode);
+ SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
+ VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
+ UpdateVoiceProviderSettingsEditor();
+ UpdateVoiceSettingsInfo();
+ }
+
+ private void LoadVoiceProviders()
+ {
+ var catalog = _voiceService!.GetProviderCatalog();
+
+ _speechToTextOptions = catalog.SpeechToTextProviders
+ .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .ToList();
+ _textToSpeechOptions = catalog.TextToSpeechProviders
+ .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .ToList();
+
+ VoiceSpeechToTextProviderComboBox.ItemsSource = _speechToTextOptions;
+ VoiceTextToSpeechProviderComboBox.ItemsSource = _textToSpeechOptions;
+
+ VoiceSpeechToTextProviderComboBox.SelectedItem =
+ _speechToTextOptions.FirstOrDefault(p => p.Id == _settings!.Voice.SpeechToTextProviderId)
+ ?? _speechToTextOptions.FirstOrDefault();
+ VoiceTextToSpeechProviderComboBox.SelectedItem =
+ _textToSpeechOptions.FirstOrDefault(p => p.Id == _settings!.Voice.TextToSpeechProviderId)
+ ?? _textToSpeechOptions.FirstOrDefault();
+ }
+
+ private async Task LoadVoiceDevicesAsync()
+ {
+ if (_settings == null || _voiceService == null)
+ {
+ return;
+ }
+
+ try
+ {
+ VoiceSettingsInfoTextBlock.Text = "Loading voice devices...";
+ var devices = await _voiceService.ListDevicesAsync();
+
+ _inputOptions =
+ [
+ new DeviceOption(null, "System default microphone")
+ ];
+ _inputOptions.AddRange(devices
+ .Where(d => d.IsInput)
+ .Select(d => new DeviceOption(d.DeviceId, d.Name)));
+
+ _outputOptions =
+ [
+ new DeviceOption(null, "System default speaker")
+ ];
+ _outputOptions.AddRange(devices
+ .Where(d => d.IsOutput)
+ .Select(d => new DeviceOption(d.DeviceId, d.Name)));
+
+ VoiceInputDeviceComboBox.ItemsSource = _inputOptions;
+ VoiceOutputDeviceComboBox.ItemsSource = _outputOptions;
+
+ VoiceInputDeviceComboBox.SelectedItem = _inputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.InputDeviceId) ?? _inputOptions[0];
+ VoiceOutputDeviceComboBox.SelectedItem = _outputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.OutputDeviceId) ?? _outputOptions[0];
+
+ UpdateVoiceSettingsInfo();
+ }
+ catch (Exception ex)
+ {
+ VoiceSettingsInfoTextBlock.Text = $"Failed to load voice devices: {ex.Message}";
+ }
+ }
+
+ private void SelectVoiceMode(VoiceActivationMode mode)
+ {
+ var target = mode switch
+ {
+ VoiceActivationMode.WakeWord => "WakeWord",
+ VoiceActivationMode.AlwaysOn => "AlwaysOn",
+ _ => "Off"
+ };
+
+ foreach (var item in VoiceModeComboBox.Items.OfType<ComboBoxItem>())
+ {
+ if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
+ {
+ VoiceModeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ VoiceModeComboBox.SelectedIndex = 0;
+ }
+
+ private VoiceActivationMode GetSelectedVoiceMode()
+ {
+ var tag = (VoiceModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
+ return tag switch
+ {
+ "WakeWord" => VoiceActivationMode.WakeWord,
+ "AlwaysOn" => VoiceActivationMode.AlwaysOn,
+ _ => VoiceActivationMode.Off
+ };
+ }
+
+ private void SelectChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
+ {
+ var target = mode == VoiceChatWindowSubmitMode.WaitForUser ? "WaitForUser" : "AutoSend";
+
+ foreach (var item in VoiceChatWindowSubmitModeComboBox.Items.OfType<ComboBoxItem>())
+ {
+ if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
+ {
+ VoiceChatWindowSubmitModeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ VoiceChatWindowSubmitModeComboBox.SelectedIndex = 0;
+ }
+
+ private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
+ {
+ var tag = (VoiceChatWindowSubmitModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
+ return tag == "WaitForUser"
+ ? VoiceChatWindowSubmitMode.WaitForUser
+ : VoiceChatWindowSubmitMode.AutoSend;
+ }
+
+ private void UpdateVoiceSettingsInfo()
+ {
+ var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
+ var tts = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
+ var input = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default microphone";
+ var output = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default speaker";
+ var fallbackNotice = string.Empty;
+
+ if (VoiceTextToSpeechProviderComboBox.SelectedItem is ProviderOption ttsOption &&
+ !VoiceProviderCatalogService.SupportsTextToSpeechRuntime(ttsOption.Id))
+ {
+ fallbackNotice = " Unsupported TTS providers will fall back to Windows until their runtime adapters are added.";
+ }
+
+ VoiceSettingsInfoTextBlock.Text =
+ $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}.{fallbackNotice}";
+ }
+
+ private void UpdateVoiceProviderSettingsEditor()
+ {
+ var providerId = GetSelectedTextToSpeechProviderId();
+ var showProviderSettings = !string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
+
+ VoiceTtsProviderSettingsPanel.Visibility = showProviderSettings ? Visibility.Visible : Visibility.Collapsed;
+ if (!showProviderSettings)
+ {
+ _activeTtsProviderId = VoiceProviderIds.Windows;
+ return;
+ }
+
+ _updatingVoiceProviderFields = true;
+ try
+ {
+ VoiceTtsProviderSettingsTitleTextBlock.Text = $"{GetSelectedTextToSpeechProviderName().ToUpperInvariant()} SETTINGS";
+ VoiceTtsApiKeyPasswordBox.Password = GetProviderApiKey(providerId) ?? string.Empty;
+ VoiceTtsModelTextBox.Text = GetProviderModel(providerId);
+ VoiceTtsVoiceIdTextBox.Text = GetProviderVoiceId(providerId);
+ _activeTtsProviderId = providerId;
+ }
+ finally
+ {
+ _updatingVoiceProviderFields = false;
+ }
+ }
+
+ private string GetSelectedTextToSpeechProviderId()
+ {
+ return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows;
+ }
+
+ private string GetSelectedTextToSpeechProviderName()
+ {
+ return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Provider";
+ }
+
+ private void CaptureSelectedVoiceProviderSettings()
+ {
+ if (_updatingVoiceProviderFields)
+ {
+ return;
+ }
+
+ var providerId = _activeTtsProviderId;
+ if (string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
+ {
+ return;
+ }
+
+ SetProviderApiKey(providerId, VoiceTtsApiKeyPasswordBox.Password);
+ SetProviderModel(providerId, VoiceTtsModelTextBox.Text);
+ SetProviderVoiceId(providerId, VoiceTtsVoiceIdTextBox.Text);
+ }
+
+ private string? GetProviderApiKey(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxApiKey,
+ VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsApiKey,
+ _ => null
+ };
+ }
+
+ private string GetProviderModel(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxModel,
+ VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsModel ?? string.Empty,
+ _ => string.Empty
+ };
+ }
+
+ private string GetProviderVoiceId(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxVoiceId,
+ VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsVoiceId ?? string.Empty,
+ _ => string.Empty
+ };
+ }
+
+ private void SetProviderApiKey(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? null : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _voiceProviderCredentialsDraft.MiniMaxApiKey = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _voiceProviderCredentialsDraft.ElevenLabsApiKey = normalized;
+ break;
+ }
+ }
+
+ private void SetProviderModel(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultModel(providerId) : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _voiceProviderCredentialsDraft.MiniMaxModel = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _voiceProviderCredentialsDraft.ElevenLabsModel = normalized;
+ break;
+ }
+ }
+
+ private void SetProviderVoiceId(string providerId, string? value)
+ {
+ var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultVoiceId(providerId) : value.Trim();
+
+ switch (providerId)
+ {
+ case VoiceProviderIds.MiniMax:
+ _voiceProviderCredentialsDraft.MiniMaxVoiceId = normalized;
+ break;
+ case VoiceProviderIds.ElevenLabs:
+ _voiceProviderCredentialsDraft.ElevenLabsVoiceId = normalized;
+ break;
+ }
+ }
+
+ private static string GetDefaultModel(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => "speech-2.8-turbo",
+ _ => string.Empty
+ };
+ }
+
+ private static string GetDefaultVoiceId(string providerId)
+ {
+ return providerId switch
+ {
+ VoiceProviderIds.MiniMax => "English_MatureBoss",
+ _ => string.Empty
+ };
+ }
+
+ private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
+ {
+ await LoadVoiceDevicesAsync();
+ }
+
+ private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
+ {
+ UpdateVoiceSettingsInfo();
+ }
+
+ private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
+ {
+ CaptureSelectedVoiceProviderSettings();
+ UpdateVoiceProviderSettingsEditor();
+ UpdateVoiceSettingsInfo();
+ }
+
+ private void OnVoiceProviderSettingsChanged(object sender, RoutedEventArgs e)
+ {
+ CaptureSelectedVoiceProviderSettings();
+ }
+
+ private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
+ {
+ return new VoiceProviderCredentials
+ {
+ MiniMaxApiKey = source.MiniMaxApiKey,
+ MiniMaxModel = source.MiniMaxModel,
+ MiniMaxVoiceId = source.MiniMaxVoiceId,
+ ElevenLabsApiKey = source.ElevenLabsApiKey,
+ ElevenLabsModel = source.ElevenLabsModel,
+ ElevenLabsVoiceId = source.ElevenLabsVoiceId
+ };
+ }
+
+ private sealed record DeviceOption(string? DeviceId, string Name);
+ private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
+}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
index 75359d1..a51f028 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
@@ -3,6 +3,7 @@
x:Class="OpenClawTray.Windows.SettingsWindow"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
+ xmlns:controls="using:OpenClawTray.Controls"
xmlns:winex="using:WinUIEx"
Title="Settings ΓÇö OpenClaw Tray"
MinWidth="400" MinHeight="500">
@@ -84,58 +85,7 @@
<Button x:Name="TestNotificationButton" x:Uid="SettingsTestNotificationButton" Content="Send Test Notification"
Click="OnTestNotification"/>
- <StackPanel Spacing="8">
- <TextBlock Text="VOICE" Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="#E74C3C" FontWeight="Bold"/>
-
- <ComboBox x:Name="VoiceModeComboBox" Header="Mode" SelectionChanged="OnVoiceModeChanged">
- <ComboBoxItem Content="Off" Tag="Off"/>
- <ComboBoxItem Content="Voice Wake" Tag="WakeWord" IsEnabled="False"/>
- <ComboBoxItem Content="Talk Mode" Tag="AlwaysOn"/>
- </ComboBox>
-
- <ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
- Header="Speech to text provider"
- DisplayMemberPath="Name"
- SelectionChanged="OnVoiceProviderChanged"/>
- <ComboBox x:Name="VoiceTextToSpeechProviderComboBox"
- Header="Text to speech provider"
- DisplayMemberPath="Name"
- SelectionChanged="OnVoiceProviderChanged"/>
-
- <StackPanel x:Name="VoiceTtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
- <TextBlock x:Name="VoiceTtsProviderSettingsTitleTextBlock"
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="#E74C3C"
- FontWeight="Bold"/>
- <PasswordBox x:Name="VoiceTtsApiKeyPasswordBox"
- Header="API key"
- PasswordChanged="OnVoiceProviderSettingsChanged"/>
- <TextBox x:Name="VoiceTtsModelTextBox"
- Header="Model"
- TextChanged="OnVoiceProviderSettingsChanged"/>
- <TextBox x:Name="VoiceTtsVoiceIdTextBox"
- Header="Voice ID"
- TextChanged="OnVoiceProviderSettingsChanged"/>
- </StackPanel>
-
- <ComboBox x:Name="VoiceInputDeviceComboBox" Header="Listen device (microphone)" DisplayMemberPath="Name"/>
- <ComboBox x:Name="VoiceOutputDeviceComboBox" Header="Talk device (speaker)" DisplayMemberPath="Name"/>
- <Button Content="Refresh voice devices" HorizontalAlignment="Left" Click="OnRefreshVoiceDevices"/>
-
- <CheckBox x:Name="VoiceConversationToastsCheckBox"
- Content="Show voice transcripts and replies as toasts"/>
-
- <ComboBox x:Name="VoiceChatWindowSubmitModeComboBox" Header="When the tray chat window is open">
- <ComboBoxItem Content="Send automatically" Tag="AutoSend"/>
- <ComboBoxItem Content="Fill message box and wait for me to send" Tag="WaitForUser"/>
- </ComboBox>
-
- <TextBlock x:Name="VoiceSettingsInfoTextBlock"
- Style="{StaticResource CaptionTextBlockStyle}"
- Foreground="{ThemeResource TextFillColorSecondaryBrush}"
- TextWrapping="Wrap"/>
- </StackPanel>
+ <controls:VoiceSettingsPanel x:Name="VoiceSettingsPanel"/>
<!-- Advanced Section -->
<StackPanel Spacing="8">
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
index b08a262..9e442e9 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
@@ -6,8 +6,6 @@
using OpenClawTray.Services;
using OpenClawTray.Services.Voice;
using System;
-using System.Collections.Generic;
-using System.Linq;
using System.Threading.Tasks;
using WinUIEx;
@@ -16,14 +14,6 @@ namespace OpenClawTray.Windows;
public sealed partial class SettingsWindow : WindowEx
{
private readonly SettingsManager _settings;
- private readonly VoiceService _voiceService;
- private VoiceProviderCredentials _voiceProviderCredentialsDraft = new();
- private string _activeTtsProviderId = VoiceProviderIds.Windows;
- private bool _updatingVoiceProviderFields;
- private List<ProviderOption> _speechToTextOptions = new();
- private List<ProviderOption> _textToSpeechOptions = new();
- private List<DeviceOption> _inputOptions = new();
- private List<DeviceOption> _outputOptions = new();
public bool IsClosed { get; private set; }
@@ -32,7 +22,6 @@ public sealed partial class SettingsWindow : WindowEx
public SettingsWindow(SettingsManager settings, VoiceService voiceService)
{
_settings = settings;
- _voiceService = voiceService;
InitializeComponent();
Title = LocalizationHelper.GetString("WindowTitle_Settings");
@@ -42,7 +31,7 @@ public SettingsWindow(SettingsManager settings, VoiceService voiceService)
this.SetIcon(IconHelper.GetStatusIconPath(ConnectionStatus.Connected));
LoadSettings();
- _ = LoadVoiceDevicesAsync();
+ VoiceSettingsPanel.Initialize(_settings, voiceService);
Closed += (s, e) => IsClosed = true;
@@ -81,78 +70,6 @@ private void LoadSettings()
NotifyInfoCb.IsChecked = _settings.NotifyInfo;
NodeModeToggle.IsOn = _settings.EnableNodeMode;
-
- LoadVoiceSettings();
- }
-
- private void LoadVoiceSettings()
- {
- _voiceProviderCredentialsDraft = Clone(_settings.VoiceProviderCredentials);
- LoadVoiceProviders();
- SelectVoiceMode(_settings.Voice.Mode);
- SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
- VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
- UpdateVoiceProviderSettingsEditor();
- UpdateVoiceSettingsInfo();
- }
-
- private void LoadVoiceProviders()
- {
- var catalog = _voiceService.GetProviderCatalog();
-
- _speechToTextOptions = catalog.SpeechToTextProviders
- .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
- .ToList();
- _textToSpeechOptions = catalog.TextToSpeechProviders
- .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
- .ToList();
-
- VoiceSpeechToTextProviderComboBox.ItemsSource = _speechToTextOptions;
- VoiceTextToSpeechProviderComboBox.ItemsSource = _textToSpeechOptions;
-
- VoiceSpeechToTextProviderComboBox.SelectedItem =
- _speechToTextOptions.FirstOrDefault(p => p.Id == _settings.Voice.SpeechToTextProviderId)
- ?? _speechToTextOptions.FirstOrDefault();
- VoiceTextToSpeechProviderComboBox.SelectedItem =
- _textToSpeechOptions.FirstOrDefault(p => p.Id == _settings.Voice.TextToSpeechProviderId)
- ?? _textToSpeechOptions.FirstOrDefault();
- }
-
- private async Task LoadVoiceDevicesAsync()
- {
- try
- {
- VoiceSettingsInfoTextBlock.Text = "Loading voice devices...";
- var devices = await _voiceService.ListDevicesAsync();
-
- _inputOptions =
- [
- new DeviceOption(null, "System default microphone")
- ];
- _inputOptions.AddRange(devices
- .Where(d => d.IsInput)
- .Select(d => new DeviceOption(d.DeviceId, d.Name)));
-
- _outputOptions =
- [
- new DeviceOption(null, "System default speaker")
- ];
- _outputOptions.AddRange(devices
- .Where(d => d.IsOutput)
- .Select(d => new DeviceOption(d.DeviceId, d.Name)));
-
- VoiceInputDeviceComboBox.ItemsSource = _inputOptions;
- VoiceOutputDeviceComboBox.ItemsSource = _outputOptions;
-
- VoiceInputDeviceComboBox.SelectedItem = _inputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.InputDeviceId) ?? _inputOptions[0];
- VoiceOutputDeviceComboBox.SelectedItem = _outputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.OutputDeviceId) ?? _outputOptions[0];
-
- UpdateVoiceSettingsInfo();
- }
- catch (Exception ex)
- {
- VoiceSettingsInfoTextBlock.Text = $"Failed to load voice devices: {ex.Message}";
- }
}
private void SaveSettings()
@@ -178,287 +95,12 @@ private void SaveSettings()
_settings.NotifyInfo = NotifyInfoCb.IsChecked ?? true;
_settings.EnableNodeMode = NodeModeToggle.IsOn;
- CaptureSelectedVoiceProviderSettings();
-
- _settings.Voice = new VoiceSettings
- {
- Mode = GetSelectedVoiceMode(),
- Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
- ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
- SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
- TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
- InputDeviceId = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
- OutputDeviceId = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
- SampleRateHz = _settings.Voice.SampleRateHz,
- CaptureChunkMs = _settings.Voice.CaptureChunkMs,
- BargeInEnabled = _settings.Voice.BargeInEnabled,
- WakeWord = new VoiceWakeWordSettings
- {
- Engine = _settings.Voice.WakeWord.Engine,
- ModelId = _settings.Voice.WakeWord.ModelId,
- TriggerThreshold = _settings.Voice.WakeWord.TriggerThreshold,
- TriggerCooldownMs = _settings.Voice.WakeWord.TriggerCooldownMs,
- PreRollMs = _settings.Voice.WakeWord.PreRollMs,
- EndSilenceMs = _settings.Voice.WakeWord.EndSilenceMs
- },
- AlwaysOn = new VoiceAlwaysOnSettings
- {
- MinSpeechMs = _settings.Voice.AlwaysOn.MinSpeechMs,
- EndSilenceMs = _settings.Voice.AlwaysOn.EndSilenceMs,
- MaxUtteranceMs = _settings.Voice.AlwaysOn.MaxUtteranceMs,
- ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
- }
- };
- _settings.VoiceProviderCredentials = Clone(_voiceProviderCredentialsDraft);
+ VoiceSettingsPanel.ApplyTo(_settings);
_settings.Save();
AutoStartManager.SetAutoStart(_settings.AutoStart);
}
- private void SelectVoiceMode(VoiceActivationMode mode)
- {
- var target = mode switch
- {
- VoiceActivationMode.WakeWord => "WakeWord",
- VoiceActivationMode.AlwaysOn => "AlwaysOn",
- _ => "Off"
- };
-
- foreach (var item in VoiceModeComboBox.Items.OfType<ComboBoxItem>())
- {
- if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
- {
- VoiceModeComboBox.SelectedItem = item;
- return;
- }
- }
-
- VoiceModeComboBox.SelectedIndex = 0;
- }
-
- private VoiceActivationMode GetSelectedVoiceMode()
- {
- var tag = (VoiceModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
- return tag switch
- {
- "WakeWord" => VoiceActivationMode.WakeWord,
- "AlwaysOn" => VoiceActivationMode.AlwaysOn,
- _ => VoiceActivationMode.Off
- };
- }
-
- private void SelectChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
- {
- var target = mode == VoiceChatWindowSubmitMode.WaitForUser ? "WaitForUser" : "AutoSend";
-
- foreach (var item in VoiceChatWindowSubmitModeComboBox.Items.OfType<ComboBoxItem>())
- {
- if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
- {
- VoiceChatWindowSubmitModeComboBox.SelectedItem = item;
- return;
- }
- }
-
- VoiceChatWindowSubmitModeComboBox.SelectedIndex = 0;
- }
-
- private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
- {
- var tag = (VoiceChatWindowSubmitModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
- return tag == "WaitForUser"
- ? VoiceChatWindowSubmitMode.WaitForUser
- : VoiceChatWindowSubmitMode.AutoSend;
- }
-
- private void UpdateVoiceSettingsInfo()
- {
- var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
- var tts = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
- var input = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default microphone";
- var output = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default speaker";
- var fallbackNotice = string.Empty;
-
- if (VoiceTextToSpeechProviderComboBox.SelectedItem is ProviderOption ttsOption &&
- !VoiceProviderCatalogService.SupportsTextToSpeechRuntime(ttsOption.Id))
- {
- fallbackNotice = " Unsupported TTS providers will fall back to Windows until their runtime adapters are added.";
- }
-
- VoiceSettingsInfoTextBlock.Text =
- $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}.{fallbackNotice}";
- }
-
- private void UpdateVoiceProviderSettingsEditor()
- {
- var providerId = GetSelectedTextToSpeechProviderId();
- var showProviderSettings = !string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
-
- VoiceTtsProviderSettingsPanel.Visibility = showProviderSettings ? Visibility.Visible : Visibility.Collapsed;
- if (!showProviderSettings)
- {
- _activeTtsProviderId = VoiceProviderIds.Windows;
- return;
- }
-
- _updatingVoiceProviderFields = true;
- try
- {
- VoiceTtsProviderSettingsTitleTextBlock.Text = $"{GetSelectedTextToSpeechProviderName().ToUpperInvariant()} SETTINGS";
- VoiceTtsApiKeyPasswordBox.Password = GetProviderApiKey(providerId) ?? string.Empty;
- VoiceTtsModelTextBox.Text = GetProviderModel(providerId);
- VoiceTtsVoiceIdTextBox.Text = GetProviderVoiceId(providerId);
- _activeTtsProviderId = providerId;
- }
- finally
- {
- _updatingVoiceProviderFields = false;
- }
- }
-
- private string GetSelectedTextToSpeechProviderId()
- {
- return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows;
- }
-
- private string GetSelectedTextToSpeechProviderName()
- {
- return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Provider";
- }
-
- private void CaptureSelectedVoiceProviderSettings()
- {
- if (_updatingVoiceProviderFields)
- {
- return;
- }
-
- var providerId = _activeTtsProviderId;
- if (string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
- {
- return;
- }
-
- SetProviderApiKey(providerId, VoiceTtsApiKeyPasswordBox.Password);
- SetProviderModel(providerId, VoiceTtsModelTextBox.Text);
- SetProviderVoiceId(providerId, VoiceTtsVoiceIdTextBox.Text);
- }
-
- private string? GetProviderApiKey(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxApiKey,
- VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsApiKey,
- _ => null
- };
- }
-
- private string GetProviderModel(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxModel,
- VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsModel ?? string.Empty,
- _ => string.Empty
- };
- }
-
- private string GetProviderVoiceId(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxVoiceId,
- VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsVoiceId ?? string.Empty,
- _ => string.Empty
- };
- }
-
- private void SetProviderApiKey(string providerId, string? value)
- {
- var normalized = string.IsNullOrWhiteSpace(value) ? null : value.Trim();
-
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _voiceProviderCredentialsDraft.MiniMaxApiKey = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _voiceProviderCredentialsDraft.ElevenLabsApiKey = normalized;
- break;
- }
- }
-
- private void SetProviderModel(string providerId, string? value)
- {
- var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultModel(providerId) : value.Trim();
-
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _voiceProviderCredentialsDraft.MiniMaxModel = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _voiceProviderCredentialsDraft.ElevenLabsModel = normalized;
- break;
- }
- }
-
- private void SetProviderVoiceId(string providerId, string? value)
- {
- var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultVoiceId(providerId) : value.Trim();
-
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _voiceProviderCredentialsDraft.MiniMaxVoiceId = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _voiceProviderCredentialsDraft.ElevenLabsVoiceId = normalized;
- break;
- }
- }
-
- private static string GetDefaultModel(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => "speech-2.8-turbo",
- _ => string.Empty
- };
- }
-
- private static string GetDefaultVoiceId(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => "English_MatureBoss",
- _ => string.Empty
- };
- }
-
- private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
- {
- await LoadVoiceDevicesAsync();
- }
-
- private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
- {
- UpdateVoiceSettingsInfo();
- }
-
- private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
- {
- CaptureSelectedVoiceProviderSettings();
- UpdateVoiceProviderSettingsEditor();
- UpdateVoiceSettingsInfo();
- }
-
- private void OnVoiceProviderSettingsChanged(object sender, RoutedEventArgs e)
- {
- CaptureSelectedVoiceProviderSettings();
- }
-
private async void OnTestConnection(object sender, RoutedEventArgs e)
{
var gatewayUrl = GatewayUrlTextBox.Text.Trim();
@@ -584,22 +226,6 @@ private void OnCancel(object sender, RoutedEventArgs e)
Close();
}
- private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
- {
- return new VoiceProviderCredentials
- {
- MiniMaxApiKey = source.MiniMaxApiKey,
- MiniMaxModel = source.MiniMaxModel,
- MiniMaxVoiceId = source.MiniMaxVoiceId,
- ElevenLabsApiKey = source.ElevenLabsApiKey,
- ElevenLabsModel = source.ElevenLabsModel,
- ElevenLabsVoiceId = source.ElevenLabsVoiceId
- };
- }
-
- private sealed record DeviceOption(string? DeviceId, string Name);
- private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
-
private class TestLogger : IOpenClawLogger
{
public string? LastError { get; private set; }
From ded41a2cfe39ef43e8522f52d5768f96413085a3 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 13:26:56 +0000
Subject: [PATCH 17/83] Generalize cloud TTS providers through catalog
contracts
---
docs/VOICE-MODE.md | 110 +++++--
src/OpenClaw.Shared/SettingsData.cs | 5 +-
src/OpenClaw.Shared/VoiceModeSchema.cs | 54 ++++
...iceProviderConfigurationStoreExtensions.cs | 161 ++++++++++
.../Controls/VoiceSettingsPanel.xaml.cs | 226 +++++++------
.../Services/SettingsManager.cs | 7 +-
.../Voice/VoiceCloudTextToSpeechClient.cs | 296 ++++++++++++++++++
.../Voice/VoiceProviderCatalogService.cs | 150 ++++++++-
.../Services/Voice/VoiceService.cs | 146 +--------
.../VoiceModeSchemaTests.cs | 33 +-
.../SettingsRoundTripTests.cs | 77 +++--
.../VoiceProviderCatalogServiceTests.cs | 26 +-
12 files changed, 959 insertions(+), 332 deletions(-)
create mode 100644 src/OpenClaw.Shared/VoiceProviderConfigurationStoreExtensions.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 18a9f0b..72f4526 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -5,6 +5,7 @@ This document defines the voice subsystem for the Windows node only. It introduc
## Goals
- Add a node-local voice mode with two activation modes: `wakeword` and `alwaysOn`
+- Utilise minimal touch points to the existing app to reduce the potential for screw-ups.
- Use NanoWakeWord for wakeword detection on-device
- Present the user-facing mode names as `Voice Wake` and `Talk Mode`
- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
@@ -131,8 +132,9 @@ The built-in default for both is `windows`.
Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
-- `minimax` TTS is implemented with `speech-2.8-turbo` and `English_MatureBoss`
-- `elevenlabs` TTS remains required next-phase work, not an optional future nice-to-have
+- built-in catalog entries exist for both `minimax` and `elevenlabs` TTS
+- `minimax` defaults to `speech-2.8-turbo` and `English_MatureBoss`
+- `elevenlabs` defaults to `eleven_multilingual_v2` and a user-supplied voice id
- non-Windows providers can be selected and persisted now
- unsupported providers fall back to Windows at runtime with a status warning
@@ -168,40 +170,74 @@ Example:
"name": "MiniMax Speech 2.8 Turbo",
"runtime": "cloud",
"enabled": true,
- "description": "speech-2.8-turbo using English_MatureBoss."
+ "description": "Cloud TTS using MiniMax HTTP text-to-speech.",
+ "settings": [
+ { "key": "apiKey", "label": "API key", "secret": true },
+ { "key": "model", "label": "Model", "defaultValue": "speech-2.8-turbo" },
+ { "key": "voiceId", "label": "Voice ID", "defaultValue": "English_MatureBoss" }
+ ],
+ "textToSpeechHttp": {
+ "endpointTemplate": "https://api.minimax.io/v1/t2a_v2",
+ "httpMethod": "POST",
+ "authenticationHeaderName": "Authorization",
+ "authenticationScheme": "Bearer",
+ "apiKeySettingKey": "apiKey",
+ "requestContentType": "application/json",
+ "requestBodyTemplate": "{ \"model\": {{model}}, \"text\": {{text}}, \"stream\": false, \"language_boost\": \"English\", \"output_format\": \"hex\", \"voice_setting\": { \"voice_id\": {{voiceId}}, \"speed\": 1, \"vol\": 1, \"pitch\": 0 }, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "responseAudioMode": "hexJsonString",
+ "responseAudioJsonPath": "data.audio",
+ "responseStatusCodeJsonPath": "base_resp.status_code",
+ "responseStatusMessageJsonPath": "base_resp.status_msg",
+ "successStatusValue": "0",
+ "outputContentType": "audio/mpeg"
+ }
},
{
"id": "elevenlabs",
"name": "ElevenLabs",
- "runtime": "gateway",
+ "runtime": "cloud",
"enabled": true,
- "description": "Required next-phase provider."
+ "description": "Cloud TTS using the ElevenLabs create speech API.",
+ "settings": [
+ { "key": "apiKey", "label": "API key", "secret": true },
+ { "key": "model", "label": "Model", "defaultValue": "eleven_multilingual_v2" },
+ { "key": "voiceId", "label": "Voice ID", "placeholder": "Enter an ElevenLabs voice ID" }
+ ],
+ "textToSpeechHttp": {
+ "endpointTemplate": "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
+ "httpMethod": "POST",
+ "authenticationHeaderName": "xi-api-key",
+ "apiKeySettingKey": "apiKey",
+ "requestContentType": "application/json",
+ "requestBodyTemplate": "{ \"text\": {{text}}, \"model_id\": {{model}} }",
+ "responseAudioMode": "binary",
+ "outputContentType": "audio/mpeg"
+ }
}
]
}
```
-This file only defines selectable providers. It does not carry API keys.
+For HTTP-backed TTS providers, the catalog carries the request/response contract. That allows a new provider to be added without recompilation, as long as it follows the same general HTTP template approach.
+
+This file defines provider metadata and HTTP contracts. It does not carry API keys.
-### Local Credentials
+### Local Provider Configuration
That means the current design is:
- local tray settings choose the preferred STT/TTS provider ids
-- provider API keys are stored in `%APPDATA%\\OpenClawTray\\settings.json` under `VoiceProviderCredentials`
+- provider API keys and editable values are stored in `%APPDATA%\\OpenClawTray\\settings.json` under `VoiceProviderConfiguration`
- OpenClaw remains the conversation endpoint for `chat.send`
- the local provider catalog remains metadata-only and must not contain secrets
This is an intentional short-term design choice so the Windows tray app can use cloud TTS providers without inventing a second catalog file for secrets. It can be revisited later if provider ownership is split differently.
-Current credential fields:
+Current configuration values are keyed by provider id. The built-in providers use:
-- `VoiceProviderCredentials.MiniMaxApiKey`
-- `VoiceProviderCredentials.MiniMaxModel`
-- `VoiceProviderCredentials.MiniMaxVoiceId`
-- `VoiceProviderCredentials.ElevenLabsApiKey`
-- `VoiceProviderCredentials.ElevenLabsModel`
-- `VoiceProviderCredentials.ElevenLabsVoiceId`
+- `apiKey`
+- `model`
+- `voiceId`
When the selected TTS provider in the Voice Mode window is not `windows`, the tray app shows provider-specific fields in the configuration form so the user can enter or edit:
@@ -242,7 +278,7 @@ These contracts are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/Voice
## Settings Schema
Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src/OpenClaw.Shared/SettingsData.cs).
-Provider credentials are persisted as `SettingsData.VoiceProviderCredentials` in the same local settings file.
+Provider configuration is persisted as `SettingsData.VoiceProviderConfiguration` in the same local settings file.
The editable voice configuration now lives in the main Settings window.
The tray `Voice Mode` window is a read-only runtime status/detail surface with a shortcut back into Settings.
@@ -276,13 +312,25 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"ChatWindowSubmitMode": "AutoSend"
}
},
- "VoiceProviderCredentials": {
- "MiniMaxApiKey": "<local secret>",
- "MiniMaxModel": "speech-2.8-turbo",
- "MiniMaxVoiceId": "English_MatureBoss",
- "ElevenLabsApiKey": null,
- "ElevenLabsModel": null,
- "ElevenLabsVoiceId": null
+ "VoiceProviderConfiguration": {
+ "Providers": [
+ {
+ "ProviderId": "minimax",
+ "Values": {
+ "apiKey": "<local secret>",
+ "model": "speech-2.8-turbo",
+ "voiceId": "English_MatureBoss"
+ }
+ },
+ {
+ "ProviderId": "elevenlabs",
+ "Values": {
+ "apiKey": "<local secret>",
+ "model": "eleven_multilingual_v2",
+ "voiceId": "voice-id"
+ }
+ }
+ ]
}
}
```
@@ -325,12 +373,10 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Voice.AlwaysOn.EndSilenceMs` | int | `900` | always-on | Silence timeout used to finalize an utterance |
| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
| `Voice.AlwaysOn.ChatWindowSubmitMode` | enum | `AutoSend` | always-on | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
-| `VoiceProviderCredentials.MiniMaxApiKey` | string? | `null` | minimax tts | API key used for MiniMax cloud TTS requests |
-| `VoiceProviderCredentials.MiniMaxModel` | string | `speech-2.8-turbo` | minimax tts | MiniMax TTS model identifier editable in the Voice Mode form |
-| `VoiceProviderCredentials.MiniMaxVoiceId` | string | `English_MatureBoss` | minimax tts | MiniMax TTS voice id editable in the Voice Mode form |
-| `VoiceProviderCredentials.ElevenLabsApiKey` | string? | `null` | elevenlabs tts | Reserved for the required ElevenLabs TTS implementation |
-| `VoiceProviderCredentials.ElevenLabsModel` | string? | `null` | elevenlabs tts | Reserved for future ElevenLabs model selection in the Voice Mode form |
-| `VoiceProviderCredentials.ElevenLabsVoiceId` | string? | `null` | elevenlabs tts | Reserved for future ElevenLabs voice selection in the Voice Mode form |
+| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching a `voice-providers.json` entry |
+| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
+| `VoiceProviderConfiguration.Providers[].Values["model"]` | string? | provider default | cloud providers | Model identifier inserted into the configured request template |
+| `VoiceProviderConfiguration.Providers[].Values["voiceId"]` | string? | provider default | cloud providers | Voice id inserted into the configured request template or URL |
At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
@@ -460,14 +506,14 @@ sequenceDiagram
Provider support is now part of the Windows voice subsystem roadmap, not a hypothetical extension:
-- `MiniMax` TTS is implemented first in the tray app
-- `ElevenLabs` TTS remains required follow-up work
+- `MiniMax` and `ElevenLabs` TTS are both expressed through built-in catalog contracts
+- additional HTTP TTS providers can be added through the local catalog without recompiling the tray app
- Windows STT remains the active speech-recognition baseline until a non-Windows STT provider is deliberately added
The Windows node still keeps provider choice bounded:
- local tray settings choose the provider ids
-- local tray settings store the provider secrets for now
+- local tray settings store the provider secrets and editable values for now
- OpenClaw still owns the conversation/session flow
This keeps the provider surface narrow while still meeting the required MiniMax/ElevenLabs support direction.
diff --git a/src/OpenClaw.Shared/SettingsData.cs b/src/OpenClaw.Shared/SettingsData.cs
index e421648..d2a2c6d 100644
--- a/src/OpenClaw.Shared/SettingsData.cs
+++ b/src/OpenClaw.Shared/SettingsData.cs
@@ -1,3 +1,4 @@
+using System.Text.Json.Serialization;
using System.Text.Json;
namespace OpenClaw.Shared;
@@ -27,7 +28,9 @@ public class SettingsData
public bool PreferStructuredCategories { get; set; } = true;
public List<UserNotificationRule>? UserRules { get; set; }
public VoiceSettings Voice { get; set; } = new();
- public VoiceProviderCredentials VoiceProviderCredentials { get; set; } = new();
+ public VoiceProviderConfigurationStore VoiceProviderConfiguration { get; set; } = new();
+ [JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
+ public VoiceProviderCredentials? VoiceProviderCredentials { get; set; }
private static readonly JsonSerializerOptions s_options = new() { WriteIndented = true };
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index bc1aadb..2f99bec 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -140,6 +140,20 @@ public static class VoiceProviderIds
public const string ElevenLabs = "elevenlabs";
}
+public static class VoiceProviderSettingKeys
+{
+ public const string ApiKey = "apiKey";
+ public const string Model = "model";
+ public const string VoiceId = "voiceId";
+}
+
+public static class VoiceTextToSpeechResponseModes
+{
+ public const string Binary = "binary";
+ public const string HexJsonString = "hexJsonString";
+ public const string Base64JsonString = "base64JsonString";
+}
+
public sealed class VoiceProviderCredentials
{
public string? MiniMaxApiKey { get; set; }
@@ -150,6 +164,44 @@ public sealed class VoiceProviderCredentials
public string? ElevenLabsVoiceId { get; set; }
}
+public sealed class VoiceProviderConfigurationStore
+{
+ public List<VoiceProviderConfiguration> Providers { get; set; } = [];
+}
+
+public sealed class VoiceProviderConfiguration
+{
+ public string ProviderId { get; set; } = "";
+ public Dictionary<string, string> Values { get; set; } = [];
+}
+
+public sealed class VoiceProviderSettingDefinition
+{
+ public string Key { get; set; } = "";
+ public string Label { get; set; } = "";
+ public bool Secret { get; set; }
+ public string? DefaultValue { get; set; }
+ public string? Placeholder { get; set; }
+ public string? Description { get; set; }
+}
+
+public sealed class VoiceTextToSpeechHttpContract
+{
+ public string EndpointTemplate { get; set; } = "";
+ public string HttpMethod { get; set; } = "POST";
+ public string AuthenticationHeaderName { get; set; } = "Authorization";
+ public string? AuthenticationScheme { get; set; } = "Bearer";
+ public string ApiKeySettingKey { get; set; } = VoiceProviderSettingKeys.ApiKey;
+ public string RequestContentType { get; set; } = "application/json";
+ public string RequestBodyTemplate { get; set; } = "";
+ public string ResponseAudioMode { get; set; } = VoiceTextToSpeechResponseModes.Binary;
+ public string? ResponseAudioJsonPath { get; set; }
+ public string? ResponseStatusCodeJsonPath { get; set; }
+ public string? ResponseStatusMessageJsonPath { get; set; }
+ public string? SuccessStatusValue { get; set; }
+ public string OutputContentType { get; set; } = "audio/mpeg";
+}
+
public sealed class VoiceProviderOption
{
public string Id { get; set; } = "";
@@ -157,6 +209,8 @@ public sealed class VoiceProviderOption
public string Runtime { get; set; } = "windows";
public bool Enabled { get; set; } = true;
public string? Description { get; set; }
+ public List<VoiceProviderSettingDefinition> Settings { get; set; } = [];
+ public VoiceTextToSpeechHttpContract? TextToSpeechHttp { get; set; }
}
public sealed class VoiceProviderCatalog
diff --git a/src/OpenClaw.Shared/VoiceProviderConfigurationStoreExtensions.cs b/src/OpenClaw.Shared/VoiceProviderConfigurationStoreExtensions.cs
new file mode 100644
index 0000000..b1dfa41
--- /dev/null
+++ b/src/OpenClaw.Shared/VoiceProviderConfigurationStoreExtensions.cs
@@ -0,0 +1,161 @@
+using System;
+using System.Collections.Generic;
+using System.Linq;
+
+namespace OpenClaw.Shared;
+
+public static class VoiceProviderConfigurationStoreExtensions
+{
+ public static VoiceProviderConfiguration GetOrAddProvider(
+ this VoiceProviderConfigurationStore store,
+ string providerId)
+ {
+ ArgumentNullException.ThrowIfNull(store);
+
+ var existing = store.Providers.FirstOrDefault(p =>
+ string.Equals(p.ProviderId, providerId, StringComparison.OrdinalIgnoreCase));
+ if (existing != null)
+ {
+ return existing;
+ }
+
+ var created = new VoiceProviderConfiguration
+ {
+ ProviderId = providerId
+ };
+ store.Providers.Add(created);
+ return created;
+ }
+
+ public static VoiceProviderConfiguration? FindProvider(
+ this VoiceProviderConfigurationStore store,
+ string? providerId)
+ {
+ ArgumentNullException.ThrowIfNull(store);
+
+ if (string.IsNullOrWhiteSpace(providerId))
+ {
+ return null;
+ }
+
+ return store.Providers.FirstOrDefault(p =>
+ string.Equals(p.ProviderId, providerId, StringComparison.OrdinalIgnoreCase));
+ }
+
+ public static string? GetValue(
+ this VoiceProviderConfigurationStore store,
+ string? providerId,
+ string settingKey)
+ {
+ return store.FindProvider(providerId)?.GetValue(settingKey);
+ }
+
+ public static string? GetValue(this VoiceProviderConfiguration configuration, string settingKey)
+ {
+ ArgumentNullException.ThrowIfNull(configuration);
+
+ if (string.IsNullOrWhiteSpace(settingKey))
+ {
+ return null;
+ }
+
+ return configuration.Values.FirstOrDefault(entry =>
+ string.Equals(entry.Key, settingKey, StringComparison.OrdinalIgnoreCase)).Value;
+ }
+
+ public static void SetValue(
+ this VoiceProviderConfigurationStore store,
+ string providerId,
+ string settingKey,
+ string? value)
+ {
+ ArgumentNullException.ThrowIfNull(store);
+
+ var provider = store.GetOrAddProvider(providerId);
+ provider.SetValue(settingKey, value);
+ }
+
+ public static void SetValue(
+ this VoiceProviderConfiguration configuration,
+ string settingKey,
+ string? value)
+ {
+ ArgumentNullException.ThrowIfNull(configuration);
+
+ if (string.IsNullOrWhiteSpace(settingKey))
+ {
+ return;
+ }
+
+ var existingKey = configuration.Values.Keys.FirstOrDefault(key =>
+ string.Equals(key, settingKey, StringComparison.OrdinalIgnoreCase));
+
+ if (string.IsNullOrWhiteSpace(value))
+ {
+ if (existingKey != null)
+ {
+ configuration.Values.Remove(existingKey);
+ }
+
+ return;
+ }
+
+ if (existingKey != null)
+ {
+ configuration.Values[existingKey] = value.Trim();
+ return;
+ }
+
+ configuration.Values[settingKey] = value.Trim();
+ }
+
+ public static void MigrateLegacyCredentials(
+ this VoiceProviderConfigurationStore store,
+ VoiceProviderCredentials? legacy)
+ {
+ ArgumentNullException.ThrowIfNull(store);
+
+ if (legacy == null)
+ {
+ return;
+ }
+
+ var hasMiniMaxValues =
+ !string.IsNullOrWhiteSpace(legacy.MiniMaxApiKey) ||
+ !string.IsNullOrWhiteSpace(legacy.MiniMaxModel) ||
+ !string.IsNullOrWhiteSpace(legacy.MiniMaxVoiceId);
+ if (hasMiniMaxValues)
+ {
+ store.SetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.ApiKey, legacy.MiniMaxApiKey);
+ store.SetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.Model, legacy.MiniMaxModel);
+ store.SetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.VoiceId, legacy.MiniMaxVoiceId);
+ }
+
+ var hasElevenLabsValues =
+ !string.IsNullOrWhiteSpace(legacy.ElevenLabsApiKey) ||
+ !string.IsNullOrWhiteSpace(legacy.ElevenLabsModel) ||
+ !string.IsNullOrWhiteSpace(legacy.ElevenLabsVoiceId);
+ if (hasElevenLabsValues)
+ {
+ store.SetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.ApiKey, legacy.ElevenLabsApiKey);
+ store.SetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.Model, legacy.ElevenLabsModel);
+ store.SetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.VoiceId, legacy.ElevenLabsVoiceId);
+ }
+ }
+
+ public static VoiceProviderConfigurationStore Clone(this VoiceProviderConfigurationStore source)
+ {
+ ArgumentNullException.ThrowIfNull(source);
+
+ return new VoiceProviderConfigurationStore
+ {
+ Providers = source.Providers
+ .Select(provider => new VoiceProviderConfiguration
+ {
+ ProviderId = provider.ProviderId,
+ Values = new Dictionary<string, string>(provider.Values, StringComparer.OrdinalIgnoreCase)
+ })
+ .ToList()
+ };
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 5c658e8..968b00d 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -14,11 +14,11 @@ public sealed partial class VoiceSettingsPanel : UserControl
{
private SettingsManager? _settings;
private VoiceService? _voiceService;
- private VoiceProviderCredentials _voiceProviderCredentialsDraft = new();
+ private VoiceProviderConfigurationStore _voiceProviderConfigurationDraft = new();
private string _activeTtsProviderId = VoiceProviderIds.Windows;
private bool _updatingVoiceProviderFields;
- private List<ProviderOption> _speechToTextOptions = new();
- private List<ProviderOption> _textToSpeechOptions = new();
+ private List<VoiceProviderOption> _speechToTextOptions = new();
+ private List<VoiceProviderOption> _textToSpeechOptions = new();
private List<DeviceOption> _inputOptions = new();
private List<DeviceOption> _outputOptions = new();
@@ -45,8 +45,8 @@ public void ApplyTo(SettingsManager settings)
Mode = GetSelectedVoiceMode(),
Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
- SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
- TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows,
+ TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows,
InputDeviceId = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
OutputDeviceId = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
SampleRateHz = settings.Voice.SampleRateHz,
@@ -69,7 +69,7 @@ public void ApplyTo(SettingsManager settings)
ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
}
};
- settings.VoiceProviderCredentials = Clone(_voiceProviderCredentialsDraft);
+ settings.VoiceProviderConfiguration = _voiceProviderConfigurationDraft.Clone();
}
private void LoadVoiceSettings()
@@ -79,7 +79,7 @@ private void LoadVoiceSettings()
return;
}
- _voiceProviderCredentialsDraft = Clone(_settings.VoiceProviderCredentials);
+ _voiceProviderConfigurationDraft = _settings.VoiceProviderConfiguration.Clone();
LoadVoiceProviders();
SelectVoiceMode(_settings.Voice.Mode);
SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
@@ -93,10 +93,10 @@ private void LoadVoiceProviders()
var catalog = _voiceService!.GetProviderCatalog();
_speechToTextOptions = catalog.SpeechToTextProviders
- .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .Select(Clone)
.ToList();
_textToSpeechOptions = catalog.TextToSpeechProviders
- .Select(p => new ProviderOption(p.Id, p.Name, p.Runtime, p.Description))
+ .Select(Clone)
.ToList();
VoiceSpeechToTextProviderComboBox.ItemsSource = _speechToTextOptions;
@@ -210,13 +210,13 @@ private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
private void UpdateVoiceSettingsInfo()
{
- var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Recognition";
- var tts = (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Windows Speech Synthesis";
+ var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Name ?? "Windows Speech Recognition";
+ var tts = (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Name ?? "Windows Speech Synthesis";
var input = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default microphone";
var output = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default speaker";
var fallbackNotice = string.Empty;
- if (VoiceTextToSpeechProviderComboBox.SelectedItem is ProviderOption ttsOption &&
+ if (VoiceTextToSpeechProviderComboBox.SelectedItem is VoiceProviderOption ttsOption &&
!VoiceProviderCatalogService.SupportsTextToSpeechRuntime(ttsOption.Id))
{
fallbackNotice = " Unsupported TTS providers will fall back to Windows until their runtime adapters are added.";
@@ -238,13 +238,28 @@ private void UpdateVoiceProviderSettingsEditor()
return;
}
+ var provider = GetSelectedTextToSpeechProvider();
+ var apiKeySetting = FindSetting(provider, VoiceProviderSettingKeys.ApiKey);
+ var modelSetting = FindSetting(provider, VoiceProviderSettingKeys.Model);
+ var voiceIdSetting = FindSetting(provider, VoiceProviderSettingKeys.VoiceId);
+
_updatingVoiceProviderFields = true;
try
{
VoiceTtsProviderSettingsTitleTextBlock.Text = $"{GetSelectedTextToSpeechProviderName().ToUpperInvariant()} SETTINGS";
- VoiceTtsApiKeyPasswordBox.Password = GetProviderApiKey(providerId) ?? string.Empty;
- VoiceTtsModelTextBox.Text = GetProviderModel(providerId);
- VoiceTtsVoiceIdTextBox.Text = GetProviderVoiceId(providerId);
+ VoiceTtsApiKeyPasswordBox.Header = apiKeySetting?.Label ?? "API key";
+ VoiceTtsApiKeyPasswordBox.Visibility = apiKeySetting != null ? Visibility.Visible : Visibility.Collapsed;
+ VoiceTtsApiKeyPasswordBox.Password = GetProviderValue(providerId, apiKeySetting) ?? string.Empty;
+
+ VoiceTtsModelTextBox.Header = modelSetting?.Label ?? "Model";
+ VoiceTtsModelTextBox.PlaceholderText = modelSetting?.Placeholder ?? string.Empty;
+ VoiceTtsModelTextBox.Visibility = modelSetting != null ? Visibility.Visible : Visibility.Collapsed;
+ VoiceTtsModelTextBox.Text = GetProviderValue(providerId, modelSetting) ?? string.Empty;
+
+ VoiceTtsVoiceIdTextBox.Header = voiceIdSetting?.Label ?? "Voice ID";
+ VoiceTtsVoiceIdTextBox.PlaceholderText = voiceIdSetting?.Placeholder ?? string.Empty;
+ VoiceTtsVoiceIdTextBox.Visibility = voiceIdSetting != null ? Visibility.Visible : Visibility.Collapsed;
+ VoiceTtsVoiceIdTextBox.Text = GetProviderValue(providerId, voiceIdSetting) ?? string.Empty;
_activeTtsProviderId = providerId;
}
finally
@@ -255,12 +270,17 @@ private void UpdateVoiceProviderSettingsEditor()
private string GetSelectedTextToSpeechProviderId()
{
- return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Id ?? VoiceProviderIds.Windows;
+ return (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows;
}
private string GetSelectedTextToSpeechProviderName()
{
- return (VoiceTextToSpeechProviderComboBox.SelectedItem as ProviderOption)?.Name ?? "Provider";
+ return (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Name ?? "Provider";
+ }
+
+ private VoiceProviderOption? GetSelectedTextToSpeechProvider()
+ {
+ return VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption;
}
private void CaptureSelectedVoiceProviderSettings()
@@ -276,139 +296,107 @@ private void CaptureSelectedVoiceProviderSettings()
return;
}
- SetProviderApiKey(providerId, VoiceTtsApiKeyPasswordBox.Password);
- SetProviderModel(providerId, VoiceTtsModelTextBox.Text);
- SetProviderVoiceId(providerId, VoiceTtsVoiceIdTextBox.Text);
+ var provider = _textToSpeechOptions.FirstOrDefault(option =>
+ string.Equals(option.Id, providerId, StringComparison.OrdinalIgnoreCase));
+ SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.ApiKey), VoiceTtsApiKeyPasswordBox.Password);
+ SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.Model), VoiceTtsModelTextBox.Text);
+ SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.VoiceId), VoiceTtsVoiceIdTextBox.Text);
}
- private string? GetProviderApiKey(string providerId)
+ private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
{
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxApiKey,
- VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsApiKey,
- _ => null
- };
+ await LoadVoiceDevicesAsync();
}
- private string GetProviderModel(string providerId)
+ private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
{
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxModel,
- VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsModel ?? string.Empty,
- _ => string.Empty
- };
+ UpdateVoiceSettingsInfo();
}
- private string GetProviderVoiceId(string providerId)
+ private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
{
- return providerId switch
- {
- VoiceProviderIds.MiniMax => _voiceProviderCredentialsDraft.MiniMaxVoiceId,
- VoiceProviderIds.ElevenLabs => _voiceProviderCredentialsDraft.ElevenLabsVoiceId ?? string.Empty,
- _ => string.Empty
- };
+ CaptureSelectedVoiceProviderSettings();
+ UpdateVoiceProviderSettingsEditor();
+ UpdateVoiceSettingsInfo();
}
- private void SetProviderApiKey(string providerId, string? value)
+ private void OnVoiceProviderSettingsChanged(object sender, RoutedEventArgs e)
{
- var normalized = string.IsNullOrWhiteSpace(value) ? null : value.Trim();
-
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _voiceProviderCredentialsDraft.MiniMaxApiKey = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _voiceProviderCredentialsDraft.ElevenLabsApiKey = normalized;
- break;
- }
+ CaptureSelectedVoiceProviderSettings();
}
- private void SetProviderModel(string providerId, string? value)
+ private string? GetProviderValue(string providerId, VoiceProviderSettingDefinition? setting)
{
- var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultModel(providerId) : value.Trim();
-
- switch (providerId)
+ if (setting == null)
{
- case VoiceProviderIds.MiniMax:
- _voiceProviderCredentialsDraft.MiniMaxModel = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _voiceProviderCredentialsDraft.ElevenLabsModel = normalized;
- break;
+ return null;
}
- }
- private void SetProviderVoiceId(string providerId, string? value)
- {
- var normalized = string.IsNullOrWhiteSpace(value) ? GetDefaultVoiceId(providerId) : value.Trim();
-
- switch (providerId)
- {
- case VoiceProviderIds.MiniMax:
- _voiceProviderCredentialsDraft.MiniMaxVoiceId = normalized;
- break;
- case VoiceProviderIds.ElevenLabs:
- _voiceProviderCredentialsDraft.ElevenLabsVoiceId = normalized;
- break;
- }
+ return _voiceProviderConfigurationDraft.GetValue(providerId, setting.Key) ?? setting.DefaultValue;
}
- private static string GetDefaultModel(string providerId)
- {
- return providerId switch
- {
- VoiceProviderIds.MiniMax => "speech-2.8-turbo",
- _ => string.Empty
- };
- }
+ private sealed record DeviceOption(string? DeviceId, string Name);
- private static string GetDefaultVoiceId(string providerId)
+ private void SetProviderValue(
+ string providerId,
+ VoiceProviderSettingDefinition? setting,
+ string? value)
{
- return providerId switch
+ if (setting == null)
{
- VoiceProviderIds.MiniMax => "English_MatureBoss",
- _ => string.Empty
- };
- }
-
- private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
- {
- await LoadVoiceDevicesAsync();
- }
-
- private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
- {
- UpdateVoiceSettingsInfo();
- }
+ return;
+ }
- private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
- {
- CaptureSelectedVoiceProviderSettings();
- UpdateVoiceProviderSettingsEditor();
- UpdateVoiceSettingsInfo();
+ var normalized = string.IsNullOrWhiteSpace(value)
+ ? setting.DefaultValue
+ : value.Trim();
+ _voiceProviderConfigurationDraft.SetValue(providerId, setting.Key, normalized);
}
- private void OnVoiceProviderSettingsChanged(object sender, RoutedEventArgs e)
+ private static VoiceProviderSettingDefinition? FindSetting(VoiceProviderOption? provider, string settingKey)
{
- CaptureSelectedVoiceProviderSettings();
+ return provider?.Settings.FirstOrDefault(setting =>
+ string.Equals(setting.Key, settingKey, StringComparison.OrdinalIgnoreCase));
}
- private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
+ private static VoiceProviderOption Clone(VoiceProviderOption source)
{
- return new VoiceProviderCredentials
+ return new VoiceProviderOption
{
- MiniMaxApiKey = source.MiniMaxApiKey,
- MiniMaxModel = source.MiniMaxModel,
- MiniMaxVoiceId = source.MiniMaxVoiceId,
- ElevenLabsApiKey = source.ElevenLabsApiKey,
- ElevenLabsModel = source.ElevenLabsModel,
- ElevenLabsVoiceId = source.ElevenLabsVoiceId
+ Id = source.Id,
+ Name = source.Name,
+ Runtime = source.Runtime,
+ Enabled = source.Enabled,
+ Description = source.Description,
+ Settings = source.Settings
+ .Select(setting => new VoiceProviderSettingDefinition
+ {
+ Key = setting.Key,
+ Label = setting.Label,
+ Secret = setting.Secret,
+ DefaultValue = setting.DefaultValue,
+ Placeholder = setting.Placeholder,
+ Description = setting.Description
+ })
+ .ToList(),
+ TextToSpeechHttp = source.TextToSpeechHttp == null
+ ? null
+ : new VoiceTextToSpeechHttpContract
+ {
+ EndpointTemplate = source.TextToSpeechHttp.EndpointTemplate,
+ HttpMethod = source.TextToSpeechHttp.HttpMethod,
+ AuthenticationHeaderName = source.TextToSpeechHttp.AuthenticationHeaderName,
+ AuthenticationScheme = source.TextToSpeechHttp.AuthenticationScheme,
+ ApiKeySettingKey = source.TextToSpeechHttp.ApiKeySettingKey,
+ RequestContentType = source.TextToSpeechHttp.RequestContentType,
+ RequestBodyTemplate = source.TextToSpeechHttp.RequestBodyTemplate,
+ ResponseAudioMode = source.TextToSpeechHttp.ResponseAudioMode,
+ ResponseAudioJsonPath = source.TextToSpeechHttp.ResponseAudioJsonPath,
+ ResponseStatusCodeJsonPath = source.TextToSpeechHttp.ResponseStatusCodeJsonPath,
+ ResponseStatusMessageJsonPath = source.TextToSpeechHttp.ResponseStatusMessageJsonPath,
+ SuccessStatusValue = source.TextToSpeechHttp.SuccessStatusValue,
+ OutputContentType = source.TextToSpeechHttp.OutputContentType
+ }
};
}
-
- private sealed record DeviceOption(string? DeviceId, string Name);
- private sealed record ProviderOption(string Id, string Name, string Runtime, string? Description);
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
index 9db01d9..0320136 100644
--- a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
@@ -43,7 +43,7 @@ public class SettingsManager
public bool PreferStructuredCategories { get; set; } = true;
public List<OpenClaw.Shared.UserNotificationRule> UserRules { get; set; } = new();
public VoiceSettings Voice { get; set; } = new();
- public VoiceProviderCredentials VoiceProviderCredentials { get; set; } = new();
+ public VoiceProviderConfigurationStore VoiceProviderConfiguration { get; set; } = new();
// Node mode (enables Windows as a node, not just operator)
public bool EnableNodeMode { get; set; } = false;
@@ -85,7 +85,8 @@ public void Load()
if (loaded.UserRules != null)
UserRules = loaded.UserRules;
Voice = loaded.Voice ?? new VoiceSettings();
- VoiceProviderCredentials = loaded.VoiceProviderCredentials ?? new VoiceProviderCredentials();
+ VoiceProviderConfiguration = loaded.VoiceProviderConfiguration?.Clone() ?? new VoiceProviderConfigurationStore();
+ VoiceProviderConfiguration.MigrateLegacyCredentials(loaded.VoiceProviderCredentials);
}
}
}
@@ -123,7 +124,7 @@ public void Save()
PreferStructuredCategories = PreferStructuredCategories,
UserRules = UserRules,
Voice = Voice,
- VoiceProviderCredentials = VoiceProviderCredentials
+ VoiceProviderConfiguration = VoiceProviderConfiguration.Clone()
};
var json = data.ToJson();
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
new file mode 100644
index 0000000..422d952
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -0,0 +1,296 @@
+using System;
+using System.Collections.Generic;
+using System.Net.Http;
+using System.Net.Http.Headers;
+using System.Runtime.InteropServices.WindowsRuntime;
+using System.Text;
+using System.Text.Json;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+using Windows.Storage.Streams;
+
+namespace OpenClawTray.Services.Voice;
+
+public sealed class VoiceCloudTextToSpeechClient
+{
+ private static readonly HttpClient s_httpClient = CreateHttpClient();
+
+ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
+ string text,
+ VoiceProviderOption provider,
+ VoiceProviderConfigurationStore configurationStore)
+ {
+ ArgumentException.ThrowIfNullOrWhiteSpace(text);
+ ArgumentNullException.ThrowIfNull(provider);
+ ArgumentNullException.ThrowIfNull(configurationStore);
+
+ var contract = provider.TextToSpeechHttp
+ ?? throw new InvalidOperationException($"TTS provider '{provider.Name}' does not expose an HTTP contract.");
+ var providerConfiguration = configurationStore.FindProvider(provider.Id);
+ var templateValues = BuildTemplateValues(text, provider, providerConfiguration, contract);
+ var endpoint = ApplyUrlTemplate(contract.EndpointTemplate, templateValues);
+ using var request = new HttpRequestMessage(ParseHttpMethod(contract.HttpMethod), endpoint);
+ ApplyAuthenticationHeader(request, contract, templateValues);
+
+ if (!string.IsNullOrWhiteSpace(contract.RequestBodyTemplate))
+ {
+ var requestBody = ApplyJsonTemplate(contract.RequestBodyTemplate, templateValues);
+ request.Content = new StringContent(
+ requestBody,
+ Encoding.UTF8,
+ string.IsNullOrWhiteSpace(contract.RequestContentType) ? "application/json" : contract.RequestContentType);
+ }
+
+ using var response = await s_httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead);
+ if (!response.IsSuccessStatusCode)
+ {
+ throw new InvalidOperationException(
+ $"{provider.Name} TTS request failed: {(int)response.StatusCode} {response.ReasonPhrase}");
+ }
+
+ if (string.Equals(contract.ResponseAudioMode, VoiceTextToSpeechResponseModes.Binary, StringComparison.OrdinalIgnoreCase))
+ {
+ var audioBytes = await response.Content.ReadAsByteArrayAsync();
+ return await CreateResultAsync(audioBytes, contract.OutputContentType);
+ }
+
+ var responseText = await response.Content.ReadAsStringAsync();
+ using var document = JsonDocument.Parse(responseText);
+ ValidateResponseStatus(provider, contract, document.RootElement);
+
+ var audioString = GetRequiredJsonString(document.RootElement, contract.ResponseAudioJsonPath);
+ var audioBytesFromJson = DecodeAudioBytes(contract.ResponseAudioMode, audioString, provider.Name);
+ return await CreateResultAsync(audioBytesFromJson, contract.OutputContentType);
+ }
+
+ private static Dictionary<string, string> BuildTemplateValues(
+ string text,
+ VoiceProviderOption provider,
+ VoiceProviderConfiguration? providerConfiguration,
+ VoiceTextToSpeechHttpContract contract)
+ {
+ var values = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
+ {
+ ["text"] = text
+ };
+
+ foreach (var setting in provider.Settings)
+ {
+ var configuredValue = providerConfiguration?.GetValue(setting.Key);
+ var effectiveValue = string.IsNullOrWhiteSpace(configuredValue)
+ ? setting.DefaultValue
+ : configuredValue.Trim();
+
+ if (string.IsNullOrWhiteSpace(effectiveValue))
+ {
+ if (setting.Secret || string.Equals(setting.Key, contract.ApiKeySettingKey, StringComparison.OrdinalIgnoreCase))
+ {
+ throw new InvalidOperationException(
+ $"{provider.Name} API key is not configured. Open Settings and complete the {provider.Name} voice provider fields.");
+ }
+
+ throw new InvalidOperationException(
+ $"{provider.Name} setting '{setting.Label}' is required. Open Settings and complete the {provider.Name} voice provider fields.");
+ }
+
+ values[setting.Key] = effectiveValue;
+ }
+
+ return values;
+ }
+
+ private static string ApplyUrlTemplate(string template, IReadOnlyDictionary<string, string> values)
+ {
+ var result = template;
+ foreach (var entry in values)
+ {
+ result = result.Replace(
+ "{{" + entry.Key + "}}",
+ Uri.EscapeDataString(entry.Value),
+ StringComparison.Ordinal);
+ }
+
+ return result;
+ }
+
+ private static string ApplyJsonTemplate(string template, IReadOnlyDictionary<string, string> values)
+ {
+ var result = template;
+ foreach (var entry in values)
+ {
+ result = result.Replace(
+ "{{" + entry.Key + "}}",
+ JsonSerializer.Serialize(entry.Value),
+ StringComparison.Ordinal);
+ }
+
+ return result;
+ }
+
+ private static void ApplyAuthenticationHeader(
+ HttpRequestMessage request,
+ VoiceTextToSpeechHttpContract contract,
+ IReadOnlyDictionary<string, string> values)
+ {
+ if (!values.TryGetValue(contract.ApiKeySettingKey, out var apiKey) || string.IsNullOrWhiteSpace(apiKey))
+ {
+ throw new InvalidOperationException("Voice provider API key is not configured.");
+ }
+
+ if (string.Equals(contract.AuthenticationHeaderName, "Authorization", StringComparison.OrdinalIgnoreCase) &&
+ !string.IsNullOrWhiteSpace(contract.AuthenticationScheme))
+ {
+ request.Headers.Authorization = new AuthenticationHeaderValue(contract.AuthenticationScheme, apiKey);
+ return;
+ }
+
+ var headerValue = string.IsNullOrWhiteSpace(contract.AuthenticationScheme)
+ ? apiKey
+ : $"{contract.AuthenticationScheme} {apiKey}";
+ request.Headers.TryAddWithoutValidation(contract.AuthenticationHeaderName, headerValue);
+ }
+
+ private static HttpMethod ParseHttpMethod(string? method)
+ {
+ if (string.Equals(method, HttpMethod.Post.Method, StringComparison.OrdinalIgnoreCase))
+ {
+ return HttpMethod.Post;
+ }
+
+ return new HttpMethod(string.IsNullOrWhiteSpace(method) ? HttpMethod.Post.Method : method);
+ }
+
+ private static void ValidateResponseStatus(
+ VoiceProviderOption provider,
+ VoiceTextToSpeechHttpContract contract,
+ JsonElement root)
+ {
+ if (string.IsNullOrWhiteSpace(contract.ResponseStatusCodeJsonPath))
+ {
+ return;
+ }
+
+ var statusValue = GetJsonValue(root, contract.ResponseStatusCodeJsonPath);
+ var statusText = statusValue.HasValue ? JsonElementToString(statusValue.Value) : null;
+ var successValue = contract.SuccessStatusValue ?? "0";
+ if (string.Equals(statusText, successValue, StringComparison.OrdinalIgnoreCase))
+ {
+ return;
+ }
+
+ var statusMessage = string.IsNullOrWhiteSpace(contract.ResponseStatusMessageJsonPath)
+ ? null
+ : GetJsonValue(root, contract.ResponseStatusMessageJsonPath).HasValue
+ ? JsonElementToString(GetJsonValue(root, contract.ResponseStatusMessageJsonPath)!.Value)
+ : null;
+ throw new InvalidOperationException(
+ string.IsNullOrWhiteSpace(statusMessage)
+ ? $"{provider.Name} TTS returned an error."
+ : $"{provider.Name} TTS returned an error: {statusMessage}");
+ }
+
+ private static JsonElement? GetJsonValue(JsonElement root, string? path)
+ {
+ if (string.IsNullOrWhiteSpace(path))
+ {
+ return null;
+ }
+
+ var current = root;
+ foreach (var segment in path.Split('.', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries))
+ {
+ if (current.ValueKind != JsonValueKind.Object || !current.TryGetProperty(segment, out current))
+ {
+ return null;
+ }
+ }
+
+ return current;
+ }
+
+ private static string GetRequiredJsonString(JsonElement root, string? path)
+ {
+ var value = GetJsonValue(root, path);
+ if (!value.HasValue)
+ {
+ throw new InvalidOperationException("Voice provider response did not contain audio data.");
+ }
+
+ var text = value.Value.GetString();
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ throw new InvalidOperationException("Voice provider response did not contain audio data.");
+ }
+
+ return text;
+ }
+
+ private static string? JsonElementToString(JsonElement element)
+ {
+ return element.ValueKind switch
+ {
+ JsonValueKind.String => element.GetString(),
+ JsonValueKind.Number => element.ToString(),
+ JsonValueKind.True => bool.TrueString,
+ JsonValueKind.False => bool.FalseString,
+ _ => element.ToString()
+ };
+ }
+
+ private static byte[] DecodeAudioBytes(string responseAudioMode, string audioValue, string providerName)
+ {
+ try
+ {
+ if (string.Equals(responseAudioMode, VoiceTextToSpeechResponseModes.HexJsonString, StringComparison.OrdinalIgnoreCase))
+ {
+ return Convert.FromHexString(audioValue);
+ }
+
+ if (string.Equals(responseAudioMode, VoiceTextToSpeechResponseModes.Base64JsonString, StringComparison.OrdinalIgnoreCase))
+ {
+ return Convert.FromBase64String(audioValue);
+ }
+
+ throw new InvalidOperationException($"Unsupported TTS response mode '{responseAudioMode}'.");
+ }
+ catch (FormatException ex)
+ {
+ throw new InvalidOperationException($"{providerName} TTS returned invalid audio data.", ex);
+ }
+ }
+
+ private static async Task<VoiceCloudTextToSpeechResult> CreateResultAsync(byte[] audioBytes, string contentType)
+ {
+ var stream = new InMemoryRandomAccessStream();
+ await stream.WriteAsync(audioBytes.AsBuffer());
+ await stream.FlushAsync();
+ stream.Seek(0);
+
+ return new VoiceCloudTextToSpeechResult(stream, string.IsNullOrWhiteSpace(contentType) ? "audio/mpeg" : contentType);
+ }
+
+ private static HttpClient CreateHttpClient()
+ {
+ return new HttpClient
+ {
+ Timeout = TimeSpan.FromSeconds(30)
+ };
+ }
+}
+
+public sealed class VoiceCloudTextToSpeechResult : IDisposable
+{
+ public VoiceCloudTextToSpeechResult(IRandomAccessStream stream, string contentType)
+ {
+ Stream = stream;
+ ContentType = contentType;
+ }
+
+ public IRandomAccessStream Stream { get; }
+ public string ContentType { get; }
+
+ public void Dispose()
+ {
+ Stream.Dispose();
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index c6132a3..7e3136d 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -81,14 +81,15 @@ public static bool SupportsWindowsRuntime(string? providerId)
return string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
}
- public static bool SupportsMiniMaxTextToSpeech(string? providerId)
- {
- return string.Equals(providerId, VoiceProviderIds.MiniMax, StringComparison.OrdinalIgnoreCase);
- }
-
public static bool SupportsTextToSpeechRuntime(string? providerId)
{
- return SupportsWindowsRuntime(providerId) || SupportsMiniMaxTextToSpeech(providerId);
+ if (SupportsWindowsRuntime(providerId))
+ {
+ return true;
+ }
+
+ var provider = ResolveTextToSpeechProvider(providerId);
+ return provider.TextToSpeechHttp != null;
}
private static VoiceProviderCatalog CreateBuiltInCatalog()
@@ -119,14 +120,107 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
Id = VoiceProviderIds.MiniMax,
Name = "MiniMax Speech 2.8 Turbo",
Runtime = "cloud",
- Description = "Cloud TTS using speech-2.8-turbo with English_MatureBoss."
+ Description = "Cloud TTS using MiniMax HTTP text-to-speech.",
+ Settings =
+ [
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.ApiKey,
+ Label = "API key",
+ Secret = true
+ },
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.Model,
+ Label = "Model",
+ DefaultValue = "speech-2.8-turbo"
+ },
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.VoiceId,
+ Label = "Voice ID",
+ DefaultValue = "English_MatureBoss"
+ }
+ ],
+ TextToSpeechHttp = new VoiceTextToSpeechHttpContract
+ {
+ EndpointTemplate = "https://api.minimax.io/v1/t2a_v2",
+ AuthenticationHeaderName = "Authorization",
+ AuthenticationScheme = "Bearer",
+ ApiKeySettingKey = VoiceProviderSettingKeys.ApiKey,
+ RequestContentType = "application/json",
+ RequestBodyTemplate = """
+ {
+ "model": {{model}},
+ "text": {{text}},
+ "stream": false,
+ "language_boost": "English",
+ "output_format": "hex",
+ "voice_setting": {
+ "voice_id": {{voiceId}},
+ "speed": 1,
+ "vol": 1,
+ "pitch": 0
+ },
+ "audio_setting": {
+ "sample_rate": 32000,
+ "bitrate": 128000,
+ "format": "mp3",
+ "channel": 1
+ }
+ }
+ """,
+ ResponseAudioMode = VoiceTextToSpeechResponseModes.HexJsonString,
+ ResponseAudioJsonPath = "data.audio",
+ ResponseStatusCodeJsonPath = "base_resp.status_code",
+ ResponseStatusMessageJsonPath = "base_resp.status_msg",
+ SuccessStatusValue = "0",
+ OutputContentType = "audio/mpeg"
+ }
},
new VoiceProviderOption
{
Id = VoiceProviderIds.ElevenLabs,
Name = "ElevenLabs",
Runtime = "cloud",
- Description = "Cloud TTS provider planned for the next phase."
+ Description = "Cloud TTS using the ElevenLabs create speech API.",
+ Settings =
+ [
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.ApiKey,
+ Label = "API key",
+ Secret = true
+ },
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.Model,
+ Label = "Model",
+ DefaultValue = "eleven_multilingual_v2"
+ },
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.VoiceId,
+ Label = "Voice ID",
+ Placeholder = "Enter an ElevenLabs voice ID"
+ }
+ ],
+ TextToSpeechHttp = new VoiceTextToSpeechHttpContract
+ {
+ EndpointTemplate = "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
+ AuthenticationHeaderName = "xi-api-key",
+ AuthenticationScheme = null,
+ ApiKeySettingKey = VoiceProviderSettingKeys.ApiKey,
+ RequestContentType = "application/json",
+ RequestBodyTemplate = """
+ {
+ "text": {{text}},
+ "model_id": {{model}}
+ }
+ """,
+ ResponseAudioMode = VoiceTextToSpeechResponseModes.Binary,
+ OutputContentType = "audio/mpeg"
+ }
}
]
};
@@ -184,7 +278,47 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
Name = source.Name,
Runtime = source.Runtime,
Enabled = source.Enabled,
+ Description = source.Description,
+ Settings = source.Settings.Select(Clone).ToList(),
+ TextToSpeechHttp = Clone(source.TextToSpeechHttp)
+ };
+ }
+
+ private static VoiceProviderSettingDefinition Clone(VoiceProviderSettingDefinition source)
+ {
+ return new VoiceProviderSettingDefinition
+ {
+ Key = source.Key,
+ Label = source.Label,
+ Secret = source.Secret,
+ DefaultValue = source.DefaultValue,
+ Placeholder = source.Placeholder,
Description = source.Description
};
}
+
+ private static VoiceTextToSpeechHttpContract? Clone(VoiceTextToSpeechHttpContract? source)
+ {
+ if (source == null)
+ {
+ return null;
+ }
+
+ return new VoiceTextToSpeechHttpContract
+ {
+ EndpointTemplate = source.EndpointTemplate,
+ HttpMethod = source.HttpMethod,
+ AuthenticationHeaderName = source.AuthenticationHeaderName,
+ AuthenticationScheme = source.AuthenticationScheme,
+ ApiKeySettingKey = source.ApiKeySettingKey,
+ RequestContentType = source.RequestContentType,
+ RequestBodyTemplate = source.RequestBodyTemplate,
+ ResponseAudioMode = source.ResponseAudioMode,
+ ResponseAudioJsonPath = source.ResponseAudioJsonPath,
+ ResponseStatusCodeJsonPath = source.ResponseStatusCodeJsonPath,
+ ResponseStatusMessageJsonPath = source.ResponseStatusMessageJsonPath,
+ SuccessStatusValue = source.SuccessStatusValue,
+ OutputContentType = source.OutputContentType
+ };
+ }
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index ebbad9b..8822916 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -3,9 +3,7 @@
using System.Globalization;
using System.Linq;
using System.Net.Http;
-using System.Net.Http.Headers;
using System.Runtime.InteropServices.WindowsRuntime;
-using System.Text;
using System.Text.Json;
using System.Threading;
using System.Threading.Tasks;
@@ -26,17 +24,14 @@ namespace OpenClawTray.Services.Voice;
public sealed class VoiceService : IVoiceRuntime, IDisposable
{
private const string DefaultSessionKey = "main";
- private const string MiniMaxTtsEndpoint = "https://api.minimax.io/v1/t2a_v2";
- private const string MiniMaxTtsModel = "speech-2.8-turbo";
- private const string MiniMaxTtsVoiceId = "English_MatureBoss";
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromSeconds(2);
- private static readonly HttpClient s_httpClient = CreateHttpClient();
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
+ private readonly VoiceCloudTextToSpeechClient _cloudTextToSpeechClient;
private readonly object _gate = new();
private VoiceStatusInfo _status;
@@ -65,6 +60,7 @@ public VoiceService(IOpenClawLogger logger, SettingsManager settings)
{
_logger = logger;
_settings = settings;
+ _cloudTextToSpeechClient = new VoiceCloudTextToSpeechClient();
_status = new VoiceStatusInfo();
_status = BuildStoppedStatus(null, null);
}
@@ -970,14 +966,14 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
private async Task SpeakTextAsync(string text)
{
VoiceSettings settings;
- VoiceProviderCredentials credentials;
+ VoiceProviderConfigurationStore providerConfiguration;
SpeechSynthesizer? synthesizer;
MediaPlayer? player;
lock (_gate)
{
settings = Clone(_settings.Voice);
- credentials = Clone(_settings.VoiceProviderCredentials);
+ providerConfiguration = _settings.VoiceProviderConfiguration.Clone();
synthesizer = _speechSynthesizer;
player = _mediaPlayer;
}
@@ -991,9 +987,10 @@ private async Task SpeakTextAsync(string text)
settings.TextToSpeechProviderId,
_logger);
- if (VoiceProviderCatalogService.SupportsMiniMaxTextToSpeech(provider.Id))
+ if (provider.TextToSpeechHttp != null)
{
- await SpeakWithMiniMaxAsync(text, credentials, player);
+ using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration);
+ await PlayStreamAsync(player, result.Stream, result.ContentType);
return;
}
@@ -1037,74 +1034,6 @@ private static async Task PlayStreamAsync(
}
}
- private async Task SpeakWithMiniMaxAsync(
- string text,
- VoiceProviderCredentials credentials,
- MediaPlayer player)
- {
- if (string.IsNullOrWhiteSpace(credentials.MiniMaxApiKey))
- {
- throw new InvalidOperationException(
- "MiniMax API key is not configured. Add VoiceProviderCredentials.MiniMaxApiKey to %APPDATA%\\OpenClawTray\\settings.json.");
- }
-
- if (string.IsNullOrWhiteSpace(text))
- {
- return;
- }
-
- var model = string.IsNullOrWhiteSpace(credentials.MiniMaxModel)
- ? MiniMaxTtsModel
- : credentials.MiniMaxModel.Trim();
- var voiceId = string.IsNullOrWhiteSpace(credentials.MiniMaxVoiceId)
- ? MiniMaxTtsVoiceId
- : credentials.MiniMaxVoiceId.Trim();
-
- var payload = BuildMiniMaxRequestPayload(text, model, voiceId);
- using var request = new HttpRequestMessage(HttpMethod.Post, MiniMaxTtsEndpoint)
- {
- Content = new StringContent(payload, Encoding.UTF8, "application/json")
- };
- request.Headers.Authorization = new AuthenticationHeaderValue("Bearer", credentials.MiniMaxApiKey);
-
- using var response = await s_httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead);
- var responseText = await response.Content.ReadAsStringAsync();
-
- if (!response.IsSuccessStatusCode)
- {
- throw new InvalidOperationException($"MiniMax TTS request failed: {(int)response.StatusCode} {response.ReasonPhrase}");
- }
-
- using var document = JsonDocument.Parse(responseText);
- var statusCode = document.RootElement
- .GetProperty("base_resp")
- .GetProperty("status_code")
- .GetInt32();
- if (statusCode != 0)
- {
- var statusMessage = document.RootElement
- .GetProperty("base_resp")
- .GetProperty("status_msg")
- .GetString() ?? "unknown error";
- throw new InvalidOperationException($"MiniMax TTS returned an error: {statusMessage}");
- }
-
- var audioHex = document.RootElement
- .GetProperty("data")
- .GetProperty("audio")
- .GetString();
- if (string.IsNullOrWhiteSpace(audioHex))
- {
- throw new InvalidOperationException("MiniMax TTS response did not contain audio data.");
- }
-
- var audioBytes = DecodeHex(audioHex);
- using var stream = new InMemoryRandomAccessStream();
- await stream.WriteAsync(audioBytes.AsBuffer());
- await stream.FlushAsync();
- await PlayStreamAsync(player, stream, "audio/mpeg");
- }
-
private async void OnSpeechRecognitionCompleted(
SpeechContinuousRecognitionSession sender,
SpeechContinuousRecognitionCompletedEventArgs args)
@@ -1455,19 +1384,6 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
};
}
- private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
- {
- return new VoiceProviderCredentials
- {
- MiniMaxApiKey = source.MiniMaxApiKey,
- MiniMaxModel = source.MiniMaxModel,
- MiniMaxVoiceId = source.MiniMaxVoiceId,
- ElevenLabsApiKey = source.ElevenLabsApiKey,
- ElevenLabsModel = source.ElevenLabsModel,
- ElevenLabsVoiceId = source.ElevenLabsVoiceId
- };
- }
-
private static string? BuildProviderFallbackMessage(
VoiceProviderOption speechToTextProvider,
VoiceProviderOption textToSpeechProvider)
@@ -1487,54 +1403,6 @@ private static VoiceProviderCredentials Clone(VoiceProviderCredentials source)
return fallbacks.Count == 0 ? null : string.Join(" ", fallbacks);
}
- private static HttpClient CreateHttpClient()
- {
- return new HttpClient
- {
- Timeout = TimeSpan.FromSeconds(30)
- };
- }
-
- private static string BuildMiniMaxRequestPayload(string text, string model, string voiceId)
- {
- var payload = new
- {
- model,
- text,
- stream = false,
- language_boost = "English",
- output_format = "hex",
- voice_setting = new
- {
- voice_id = voiceId,
- speed = 1,
- vol = 1,
- pitch = 0
- },
- audio_setting = new
- {
- sample_rate = 32000,
- bitrate = 128000,
- format = "mp3",
- channel = 1
- }
- };
-
- return JsonSerializer.Serialize(payload);
- }
-
- private static byte[] DecodeHex(string hex)
- {
- try
- {
- return Convert.FromHexString(hex);
- }
- catch (FormatException ex)
- {
- throw new InvalidOperationException("MiniMax TTS returned invalid audio data.", ex);
- }
- }
-
private static string GetUserFacingErrorMessage(Exception ex)
{
if (IsSpeechPrivacyDeclined(ex))
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 4ec1308..e566f46 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -85,15 +85,32 @@ public void VoiceProviderIds_ExposeRequiredBuiltInProviders()
}
[Fact]
- public void VoiceProviderCredentials_Defaults_ToEmptySecrets()
+ public void VoiceProviderConfigurationStore_Defaults_ToEmptyProviders()
{
- var credentials = new VoiceProviderCredentials();
+ var configuration = new VoiceProviderConfigurationStore();
- Assert.Null(credentials.MiniMaxApiKey);
- Assert.Equal("speech-2.8-turbo", credentials.MiniMaxModel);
- Assert.Equal("English_MatureBoss", credentials.MiniMaxVoiceId);
- Assert.Null(credentials.ElevenLabsApiKey);
- Assert.Null(credentials.ElevenLabsModel);
- Assert.Null(credentials.ElevenLabsVoiceId);
+ Assert.Empty(configuration.Providers);
+ }
+
+ [Fact]
+ public void VoiceProviderConfigurationStore_MigratesLegacyProviderCredentials()
+ {
+ var configuration = new VoiceProviderConfigurationStore();
+ configuration.MigrateLegacyCredentials(new VoiceProviderCredentials
+ {
+ MiniMaxApiKey = "minimax-key",
+ MiniMaxModel = "speech-2.8-turbo",
+ MiniMaxVoiceId = "English_MatureBoss",
+ ElevenLabsApiKey = "eleven-key",
+ ElevenLabsModel = "eleven_multilingual_v2",
+ ElevenLabsVoiceId = "voice-42"
+ });
+
+ Assert.Equal("minimax-key", configuration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.ApiKey));
+ Assert.Equal("speech-2.8-turbo", configuration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.Model));
+ Assert.Equal("English_MatureBoss", configuration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.VoiceId));
+ Assert.Equal("eleven-key", configuration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.ApiKey));
+ Assert.Equal("eleven_multilingual_v2", configuration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.Model));
+ Assert.Equal("voice-42", configuration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.VoiceId));
}
}
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 1cc3f0d..65062fa 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -1,3 +1,4 @@
+using System.Collections.Generic;
using System.Text.Json;
using OpenClaw.Shared;
@@ -57,14 +58,31 @@ public void RoundTrip_AllFields_Preserved()
ChatWindowSubmitMode = VoiceChatWindowSubmitMode.WaitForUser
}
},
- VoiceProviderCredentials = new VoiceProviderCredentials
+ VoiceProviderConfiguration = new VoiceProviderConfigurationStore
{
- MiniMaxApiKey = "minimax-key",
- MiniMaxModel = "speech-2.8-turbo",
- MiniMaxVoiceId = "English_MatureBoss",
- ElevenLabsApiKey = "eleven-key",
- ElevenLabsModel = "eleven-v3",
- ElevenLabsVoiceId = "voice-42"
+ Providers =
+ [
+ new VoiceProviderConfiguration
+ {
+ ProviderId = VoiceProviderIds.MiniMax,
+ Values = new Dictionary<string, string>
+ {
+ [VoiceProviderSettingKeys.ApiKey] = "minimax-key",
+ [VoiceProviderSettingKeys.Model] = "speech-2.8-turbo",
+ [VoiceProviderSettingKeys.VoiceId] = "English_MatureBoss"
+ }
+ },
+ new VoiceProviderConfiguration
+ {
+ ProviderId = VoiceProviderIds.ElevenLabs,
+ Values = new Dictionary<string, string>
+ {
+ [VoiceProviderSettingKeys.ApiKey] = "eleven-key",
+ [VoiceProviderSettingKeys.Model] = "eleven_multilingual_v2",
+ [VoiceProviderSettingKeys.VoiceId] = "voice-42"
+ }
+ }
+ ]
},
UserRules = new List<UserNotificationRule>
{
@@ -107,13 +125,13 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal(0.72f, restored.Voice.WakeWord.TriggerThreshold);
Assert.Equal(300, restored.Voice.AlwaysOn.MinSpeechMs);
Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.AlwaysOn.ChatWindowSubmitMode);
- Assert.NotNull(restored.VoiceProviderCredentials);
- Assert.Equal("minimax-key", restored.VoiceProviderCredentials.MiniMaxApiKey);
- Assert.Equal("speech-2.8-turbo", restored.VoiceProviderCredentials.MiniMaxModel);
- Assert.Equal("English_MatureBoss", restored.VoiceProviderCredentials.MiniMaxVoiceId);
- Assert.Equal("eleven-key", restored.VoiceProviderCredentials.ElevenLabsApiKey);
- Assert.Equal("eleven-v3", restored.VoiceProviderCredentials.ElevenLabsModel);
- Assert.Equal("voice-42", restored.VoiceProviderCredentials.ElevenLabsVoiceId);
+ Assert.NotNull(restored.VoiceProviderConfiguration);
+ Assert.Equal("minimax-key", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.ApiKey));
+ Assert.Equal("speech-2.8-turbo", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.Model));
+ Assert.Equal("English_MatureBoss", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.VoiceId));
+ Assert.Equal("eleven-key", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.ApiKey));
+ Assert.Equal("eleven_multilingual_v2", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.Model));
+ Assert.Equal("voice-42", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.VoiceId));
Assert.NotNull(restored.UserRules);
Assert.Single(restored.UserRules);
Assert.Equal("build.*fail", restored.UserRules[0].Pattern);
@@ -165,18 +183,35 @@ public void MissingFields_UseDefaults()
Assert.False(settings.Voice.ShowConversationToasts);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
- Assert.NotNull(settings.VoiceProviderCredentials);
- Assert.Null(settings.VoiceProviderCredentials.MiniMaxApiKey);
- Assert.Equal("speech-2.8-turbo", settings.VoiceProviderCredentials.MiniMaxModel);
- Assert.Equal("English_MatureBoss", settings.VoiceProviderCredentials.MiniMaxVoiceId);
- Assert.Null(settings.VoiceProviderCredentials.ElevenLabsApiKey);
- Assert.Null(settings.VoiceProviderCredentials.ElevenLabsModel);
- Assert.Null(settings.VoiceProviderCredentials.ElevenLabsVoiceId);
+ Assert.NotNull(settings.VoiceProviderConfiguration);
+ Assert.Empty(settings.VoiceProviderConfiguration.Providers);
Assert.Equal(16000, settings.Voice.SampleRateHz);
Assert.Equal("NanoWakeWord", settings.Voice.WakeWord.Engine);
Assert.Null(settings.UserRules);
}
+ [Fact]
+ public void LegacyVoiceProviderCredentials_Deserialize_ForMigration()
+ {
+ var json = """
+ {
+ "VoiceProviderCredentials": {
+ "MiniMaxApiKey": "minimax-key",
+ "MiniMaxModel": "speech-2.8-turbo",
+ "MiniMaxVoiceId": "English_MatureBoss"
+ }
+ }
+ """;
+
+ var settings = SettingsData.FromJson(json);
+
+ Assert.NotNull(settings);
+ Assert.NotNull(settings.VoiceProviderCredentials);
+ Assert.Equal("minimax-key", settings.VoiceProviderCredentials.MiniMaxApiKey);
+ Assert.Equal("speech-2.8-turbo", settings.VoiceProviderCredentials.MiniMaxModel);
+ Assert.Equal("English_MatureBoss", settings.VoiceProviderCredentials.MiniMaxVoiceId);
+ }
+
[Fact]
public void BackwardCompatibility_OldSettingsWithoutNewFields()
{
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 159c06a..99fae22 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -1,5 +1,6 @@
using OpenClaw.Shared;
using OpenClawTray.Services.Voice;
+using System.Linq;
namespace OpenClaw.Tray.Tests;
@@ -20,6 +21,29 @@ public void SupportsTextToSpeechRuntime_ReturnsTrueForMiniMaxOnlyWhenImplemented
{
Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.Windows));
Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.MiniMax));
- Assert.False(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.ElevenLabs));
+ Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.ElevenLabs));
+ }
+
+ [Fact]
+ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
+ {
+ var catalog = VoiceProviderCatalogService.LoadCatalog();
+
+ var minimax = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
+ Assert.NotNull(minimax.TextToSpeechHttp);
+ Assert.Equal("https://api.minimax.io/v1/t2a_v2", minimax.TextToSpeechHttp!.EndpointTemplate);
+ Assert.Equal("Authorization", minimax.TextToSpeechHttp.AuthenticationHeaderName);
+ Assert.Equal(VoiceTextToSpeechResponseModes.HexJsonString, minimax.TextToSpeechHttp.ResponseAudioMode);
+ Assert.Equal("speech-2.8-turbo", minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model).DefaultValue);
+ Assert.Equal("English_MatureBoss", minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.VoiceId).DefaultValue);
+
+ var elevenLabs = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.ElevenLabs);
+ Assert.NotNull(elevenLabs.TextToSpeechHttp);
+ Assert.Equal(
+ "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
+ elevenLabs.TextToSpeechHttp!.EndpointTemplate);
+ Assert.Equal("xi-api-key", elevenLabs.TextToSpeechHttp.AuthenticationHeaderName);
+ Assert.Equal(VoiceTextToSpeechResponseModes.Binary, elevenLabs.TextToSpeechHttp.ResponseAudioMode);
+ Assert.Equal("eleven_multilingual_v2", elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model).DefaultValue);
}
}
From 199e534dd3cde785343d4d685e70153e371949fd Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 13:38:54 +0000
Subject: [PATCH 18/83] Rename voice modes to VoiceWake and TalkMode
---
docs/VOICE-MODE.md | 76 +++++-----
src/OpenClaw.Shared/SettingsData.cs | 19 ++-
src/OpenClaw.Shared/VoiceModeSchema.cs | 94 ++++++++++--
src/OpenClaw.Tray.WinUI/App.xaml.cs | 2 +-
.../Controls/VoiceSettingsPanel.xaml | 4 +-
.../Controls/VoiceSettingsPanel.xaml.cs | 32 ++--
.../Services/Voice/VoiceDisplayHelper.cs | 6 +-
.../Services/Voice/VoiceService.cs | 140 +++++++++---------
.../Windows/VoiceModeWindow.xaml.cs | 4 +-
.../OpenClaw.Shared.Tests/CapabilityTests.cs | 16 +-
.../VoiceModeSchemaTests.cs | 16 +-
.../SettingsRoundTripTests.cs | 20 +--
12 files changed, 258 insertions(+), 171 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 72f4526..e7c7154 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -4,7 +4,7 @@ This document defines the voice subsystem for the Windows node only. It introduc
## Goals
-- Add a node-local voice mode with two activation modes: `wakeword` and `alwaysOn`
+- Add a node-local voice mode with two activation modes: `VoiceWake` and `TalkMode`
- Utilise minimal touch points to the existing app to reduce the potential for screw-ups.
- Use NanoWakeWord for wakeword detection on-device
- Present the user-facing mode names as `Voice Wake` and `Talk Mode`
@@ -39,14 +39,14 @@ The tray app now uses user-facing names rather than exposing the internal enum n
| Internal Mode | Visible Name | Availability |
|---|---|---|
| `Off` | Off | available |
-| `WakeWord` | Voice Wake | visible but disabled for now |
-| `AlwaysOn` | Talk Mode | available |
+| `VoiceWake` | Voice Wake | visible but disabled for now |
+| `TalkMode` | Talk Mode | available |
-Internally the contracts and persisted settings still use `WakeWord` and `AlwaysOn`.
+The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
## Transport Boundary
-`AlwaysOn` follows a talk-mode style control flow:
+`TalkMode` follows the current talk-mode style control flow:
- the node captures audio locally
- local speech recognition turns that audio into transcript text
@@ -57,7 +57,7 @@ Internally the contracts and persisted settings still use `WakeWord` and `Always
That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
-The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `AlwaysOn`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
+The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `TalkMode`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
## Tray Chat Integration Decision
@@ -69,7 +69,7 @@ Voice mode and typed chat must remain part of the same user-visible conversation
### Problem Encountered
-When `AlwaysOn` sends transcript text to the main OpenClaw session, the upstream session can include scaffolding such as `<relevant-memories>...</relevant-memories>` in the rendered user message body shown in the tray chat window.
+When `TalkMode` sends transcript text to the main OpenClaw session, the upstream session can include scaffolding such as `<relevant-memories>...</relevant-memories>` in the rendered user message body shown in the tray chat window.
That produced two UX problems:
@@ -104,7 +104,7 @@ The embedded [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatW
- interim STT hypotheses from Windows speech recognition are injected into the tray chat compose box while the user is speaking
- if the chat window opens during an utterance, the current buffered transcript is copied into the compose box immediately
- if the chat window closes during an utterance, voice continues windowless and the final utterance still submits
-- if the chat window is open and ready when the utterance finalizes, the tray app either auto-submits through the page's own send path or leaves the draft for manual send, depending on `Voice.AlwaysOn.ChatWindowSubmitMode`
+- if the chat window is open and ready when the utterance finalizes, the tray app either auto-submits through the page's own send path or leaves the draft for manual send, depending on `Voice.TalkMode.ChatWindowSubmitMode`
- in `WaitForUser` mode, voice capture pauses after finalizing the draft so the next utterance does not overwrite the unsent message
- if the chat window is not open or not ready, the voice service falls back to direct `chat.send`
- rendered chat content inside the tray window is still sanitized to remove `<relevant-memories>...</relevant-memories>` blocks as a fallback for messages that were sent while windowless
@@ -245,7 +245,7 @@ When the selected TTS provider in the Voice Mode window is not `windows`, the tr
- model
- voice id
-For `WakeWord`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
+For `VoiceWake`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
## Command Surface
@@ -265,8 +265,8 @@ The voice subsystem is introduced as a new node capability category: `voice`.
### Payload Types
- `VoiceSettings`
-- `VoiceWakeWordSettings`
-- `VoiceAlwaysOnSettings`
+- `VoiceWakeSettings`
+- `TalkModeSettings`
- `VoiceAudioDeviceInfo`
- `VoiceStatusInfo`
- `VoiceStartArgs`
@@ -288,7 +288,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
```json
{
"Voice": {
- "Mode": "WakeWord",
+ "Mode": "VoiceWake",
"Enabled": true,
"SpeechToTextProviderId": "windows",
"TextToSpeechProviderId": "windows",
@@ -297,7 +297,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"SampleRateHz": 16000,
"CaptureChunkMs": 80,
"BargeInEnabled": true,
- "WakeWord": {
+ "VoiceWake": {
"Engine": "NanoWakeWord",
"ModelId": "hey_openclaw",
"TriggerThreshold": 0.65,
@@ -305,7 +305,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"PreRollMs": 1200,
"EndSilenceMs": 900
},
- "AlwaysOn": {
+ "TalkMode": {
"MinSpeechMs": 250,
"EndSilenceMs": 900,
"MaxUtteranceMs": 15000,
@@ -339,7 +339,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| Field | Purpose |
|---|---|
-| `Mode` | Top-level activation mode: `Off`, `WakeWord`, `AlwaysOn` |
+| `Mode` | Top-level activation mode: `Off`, `VoiceWake`, `TalkMode` |
| `Enabled` | Global feature kill-switch independent of mode |
| `SpeechToTextProviderId` | Selected STT provider id from the local provider catalog |
| `TextToSpeechProviderId` | Selected TTS provider id from the local provider catalog |
@@ -347,14 +347,14 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `SampleRateHz` | Shared capture sample rate, fixed to a speech-friendly default |
| `CaptureChunkMs` | Frame size for capture, VAD, and wakeword processing |
| `BargeInEnabled` | Allows microphone capture while audio playback is active |
-| `WakeWord.*` | NanoWakeWord and post-trigger utterance capture tuning |
-| `AlwaysOn.*` | Continuous-listening segmentation tuning |
+| `VoiceWake.*` | NanoWakeWord and post-trigger utterance capture tuning |
+| `TalkMode.*` | Continuous-listening segmentation tuning |
### Complete Settings Definition
| Setting | Type | Default | Applies To | Meaning |
|---|---|---|---|---|
-| `Voice.Mode` | enum | `Off` | all | Activation mode: `Off`, `WakeWord`, `AlwaysOn` |
+| `Voice.Mode` | enum | `Off` | all | Activation mode: `Off`, `VoiceWake`, `TalkMode` |
| `Voice.Enabled` | bool | `false` | all | Master enable/disable flag for voice mode |
| `Voice.SpeechToTextProviderId` | string | `windows` | all | Preferred speech-to-text provider id |
| `Voice.TextToSpeechProviderId` | string | `windows` | all | Preferred text-to-speech provider id |
@@ -363,22 +363,22 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Voice.SampleRateHz` | int | `16000` | all | Internal capture rate used for wakeword, VAD, and utterance assembly |
| `Voice.CaptureChunkMs` | int | `80` | all | Audio frame duration used by the capture loop |
| `Voice.BargeInEnabled` | bool | `true` | all | If `true`, microphone capture may continue while response audio is playing |
-| `Voice.WakeWord.Engine` | string | `NanoWakeWord` | wakeword | Wakeword engine identifier |
-| `Voice.WakeWord.ModelId` | string | `hey_openclaw` | wakeword | Wakeword model/profile identifier |
-| `Voice.WakeWord.TriggerThreshold` | float | `0.65` | wakeword | Minimum score required to trigger wakeword activation |
-| `Voice.WakeWord.TriggerCooldownMs` | int | `2000` | wakeword | Minimum delay before another wakeword trigger is accepted |
-| `Voice.WakeWord.PreRollMs` | int | `1200` | wakeword | Buffered audio retained before the trigger point |
-| `Voice.WakeWord.EndSilenceMs` | int | `900` | wakeword | Silence timeout used to finalize the post-trigger utterance |
-| `Voice.AlwaysOn.MinSpeechMs` | int | `250` | always-on | Minimum detected speech duration before an utterance is treated as real input |
-| `Voice.AlwaysOn.EndSilenceMs` | int | `900` | always-on | Silence timeout used to finalize an utterance |
-| `Voice.AlwaysOn.MaxUtteranceMs` | int | `15000` | always-on | Hard cap on utterance length before forced submission/finalization |
-| `Voice.AlwaysOn.ChatWindowSubmitMode` | enum | `AutoSend` | always-on | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
+| `Voice.VoiceWake.Engine` | string | `NanoWakeWord` | voice wake | Voice Wake engine identifier |
+| `Voice.VoiceWake.ModelId` | string | `hey_openclaw` | voice wake | Voice Wake model/profile identifier |
+| `Voice.VoiceWake.TriggerThreshold` | float | `0.65` | voice wake | Minimum score required to trigger Voice Wake activation |
+| `Voice.VoiceWake.TriggerCooldownMs` | int | `2000` | voice wake | Minimum delay before another Voice Wake trigger is accepted |
+| `Voice.VoiceWake.PreRollMs` | int | `1200` | voice wake | Buffered audio retained before the trigger point |
+| `Voice.VoiceWake.EndSilenceMs` | int | `900` | voice wake | Silence timeout used to finalize the post-trigger utterance |
+| `Voice.TalkMode.MinSpeechMs` | int | `250` | talk mode | Minimum detected speech duration before an utterance is treated as real input |
+| `Voice.TalkMode.EndSilenceMs` | int | `900` | talk mode | Silence timeout used to finalize an utterance |
+| `Voice.TalkMode.MaxUtteranceMs` | int | `15000` | talk mode | Hard cap on utterance length before forced submission/finalization |
+| `Voice.TalkMode.ChatWindowSubmitMode` | enum | `AutoSend` | talk mode | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching a `voice-providers.json` entry |
| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
| `VoiceProviderConfiguration.Providers[].Values["model"]` | string? | provider default | cloud providers | Model identifier inserted into the configured request template |
| `VoiceProviderConfiguration.Providers[].Values["voiceId"]` | string? | provider default | cloud providers | Voice id inserted into the configured request template or URL |
-At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `AlwaysOn` path still uses the Windows system speech stack defaults for capture and playback.
+At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `TalkMode` path still uses the Windows system speech stack defaults for capture and playback.
## Component Architecture
@@ -387,7 +387,7 @@ flowchart LR
A["NodeService<br/>control + lifecycle"] --> B["VoiceCapability<br/>command surface"]
B --> C["VoiceCoordinator<br/>runtime state machine"]
C --> D["SpeechRecognizer<br/>Windows continuous dictation"]
- C --> E["WakeWordService<br/>NanoWakeWord scores"]
+ C --> E["VoiceWakeService<br/>NanoWakeWord scores"]
C --> F["VoiceActivityDetector<br/>speech/silence segments"]
C --> G["VoiceTransport<br/>operator sidecar + chat.send exchange"]
C --> H["SpeechSynthesizer + MediaPlayer<br/>reply playback"]
@@ -396,16 +396,16 @@ flowchart LR
## Runtime Data Flow
-### Wakeword Mode
+### Voice Wake Mode
```mermaid
flowchart TD
A["Microphone device<br/>float/PCM hardware frames"] --> B["AudioCaptureService<br/>PCM16 mono 16kHz chunks"]
B --> C["Ring Buffer<br/>bounded pre-roll PCM16 frames"]
- B --> D["WakeWordService (NanoWakeWord)<br/>wake score per chunk"]
+B --> D["VoiceWakeService (NanoWakeWord)<br/>wake score per chunk"]
D --> E{"score >= trigger threshold?"}
E -- "no" --> B
- E -- "yes" --> F["VoiceCoordinator<br/>WakeWordDetected(session state change)"]
+E -- "yes" --> F["VoiceCoordinator<br/>VoiceWakeDetected(session state change)"]
F --> G["UtteranceAssembler<br/>seed with pre-roll PCM16 from Ring Buffer"]
C --> G
B --> G
@@ -445,7 +445,7 @@ flowchart TD
| Stage | Component | Input | Output |
|---|---|---|---|
| 1 | `SpeechRecognizer` | Windows microphone capture | recognized transcript text |
-| 2a | `WakeWordService` | PCM16 chunk | wake score / trigger decision |
+| 2a | `VoiceWakeService` | PCM16 chunk | wake score / trigger decision |
| 2b | `VoiceActivityDetector` | PCM16 chunk | speech/silence state |
| 3 | `Ring Buffer` | PCM16 chunk stream | bounded pre-roll PCM16 window |
| 4 | `UtteranceAssembler` | pre-roll + live PCM16 chunks | utterance PCM16 buffer |
@@ -469,9 +469,9 @@ sequenceDiagram
VoiceCap->>Store: save VoiceSettings
VoiceCap-->>Gateway: VoiceSettings
- Gateway->>VoiceCap: voice.start(mode=WakeWord, sessionKey=...)
+Gateway->>VoiceCap: voice.start(mode=VoiceWake, sessionKey=...)
VoiceCap->>Coord: Start(VoiceStartArgs)
- Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForWakeWord)
+Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForVoiceWake)
VoiceCap-->>Gateway: VoiceStatusInfo
Gateway->>VoiceCap: voice.status.get
@@ -492,13 +492,13 @@ sequenceDiagram
- `WindowsNodeClient` remains the gateway/node transport
- existing node capability registration remains the integration pattern
- current request/response transport remains the v1 control plane
-- `AlwaysOn` should reuse existing `chat.send` message flow instead of inventing an audio-upload protocol
+- `TalkMode` should reuse existing `chat.send` message flow instead of inventing an audio-upload protocol
### New Components Expected Later
- `VoiceCapability` in `OpenClaw.Shared.Capabilities`
- `AudioCaptureService` in `OpenClaw.Tray.WinUI.Services`
-- `WakeWordService` in `OpenClaw.Tray.WinUI.Services`
+- `VoiceWakeService` in `OpenClaw.Tray.WinUI.Services`
- `VoiceCoordinator` in `OpenClaw.Tray.WinUI.Services`
- `AudioPlaybackService` in `OpenClaw.Tray.WinUI.Services`
diff --git a/src/OpenClaw.Shared/SettingsData.cs b/src/OpenClaw.Shared/SettingsData.cs
index d2a2c6d..6bcfcd9 100644
--- a/src/OpenClaw.Shared/SettingsData.cs
+++ b/src/OpenClaw.Shared/SettingsData.cs
@@ -1,3 +1,4 @@
+using System;
using System.Text.Json.Serialization;
using System.Text.Json;
@@ -40,11 +41,27 @@ public class SettingsData
{
try
{
- return JsonSerializer.Deserialize<SettingsData>(json);
+ return JsonSerializer.Deserialize<SettingsData>(MigrateLegacyVoiceJson(json));
}
catch
{
return null;
}
}
+
+ private static string MigrateLegacyVoiceJson(string json)
+ {
+ return json
+ .Replace("\"WakeWord\":", "\"VoiceWake\":", StringComparison.Ordinal)
+ .Replace("\"AlwaysOn\":", "\"TalkMode\":", StringComparison.Ordinal)
+ .Replace("\"WakeWordModelId\":", "\"VoiceWakeModelId\":", StringComparison.Ordinal)
+ .Replace("\"WakeWordLoaded\":", "\"VoiceWakeLoaded\":", StringComparison.Ordinal)
+ .Replace("\"LastWakeWordUtc\":", "\"LastVoiceWakeUtc\":", StringComparison.Ordinal)
+ .Replace("\"Mode\":\"WakeWord\"", "\"Mode\":\"VoiceWake\"", StringComparison.Ordinal)
+ .Replace("\"Mode\": \"WakeWord\"", "\"Mode\": \"VoiceWake\"", StringComparison.Ordinal)
+ .Replace("\"Mode\":\"AlwaysOn\"", "\"Mode\":\"TalkMode\"", StringComparison.Ordinal)
+ .Replace("\"Mode\": \"AlwaysOn\"", "\"Mode\": \"TalkMode\"", StringComparison.Ordinal)
+ .Replace("\"State\":\"ListeningForWakeWord\"", "\"State\":\"ListeningForVoiceWake\"", StringComparison.Ordinal)
+ .Replace("\"State\": \"ListeningForWakeWord\"", "\"State\": \"ListeningForVoiceWake\"", StringComparison.Ordinal);
+ }
}
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 2f99bec..d3d6d42 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -1,4 +1,6 @@
+using System;
using System.Collections.ObjectModel;
+using System.Text.Json;
using System.Text.Json.Serialization;
namespace OpenClaw.Shared;
@@ -25,22 +27,22 @@ public static class VoiceCommands
public static IReadOnlyList<string> All => s_all;
}
-[JsonConverter(typeof(JsonStringEnumConverter<VoiceActivationMode>))]
+[JsonConverter(typeof(VoiceActivationModeJsonConverter))]
public enum VoiceActivationMode
{
Off,
- WakeWord,
- AlwaysOn
+ VoiceWake,
+ TalkMode
}
-[JsonConverter(typeof(JsonStringEnumConverter<VoiceRuntimeState>))]
+[JsonConverter(typeof(VoiceRuntimeStateJsonConverter))]
public enum VoiceRuntimeState
{
Stopped,
Paused,
Idle,
Arming,
- ListeningForWakeWord,
+ ListeningForVoiceWake,
ListeningContinuously,
RecordingUtterance,
SubmittingAudio,
@@ -69,11 +71,11 @@ public sealed class VoiceSettings
public int SampleRateHz { get; set; } = 16000;
public int CaptureChunkMs { get; set; } = 80;
public bool BargeInEnabled { get; set; } = true;
- public VoiceWakeWordSettings WakeWord { get; set; } = new();
- public VoiceAlwaysOnSettings AlwaysOn { get; set; } = new();
+ public VoiceWakeSettings VoiceWake { get; set; } = new();
+ public TalkModeSettings TalkMode { get; set; } = new();
}
-public sealed class VoiceWakeWordSettings
+public sealed class VoiceWakeSettings
{
public string Engine { get; set; } = "NanoWakeWord";
public string ModelId { get; set; } = "hey_openclaw";
@@ -83,7 +85,7 @@ public sealed class VoiceWakeWordSettings
public int EndSilenceMs { get; set; } = 900;
}
-public sealed class VoiceAlwaysOnSettings
+public sealed class TalkModeSettings
{
public int MinSpeechMs { get; set; } = 250;
public int EndSilenceMs { get; set; } = 900;
@@ -109,9 +111,9 @@ public sealed class VoiceStatusInfo
public string? SessionKey { get; set; }
public string? InputDeviceId { get; set; }
public string? OutputDeviceId { get; set; }
- public string? WakeWordModelId { get; set; }
- public bool WakeWordLoaded { get; set; }
- public DateTime? LastWakeWordUtc { get; set; }
+ public string? VoiceWakeModelId { get; set; }
+ public bool VoiceWakeLoaded { get; set; }
+ public DateTime? LastVoiceWakeUtc { get; set; }
public DateTime? LastUtteranceUtc { get; set; }
public string? LastError { get; set; }
}
@@ -218,3 +220,71 @@ public sealed class VoiceProviderCatalog
public List<VoiceProviderOption> SpeechToTextProviders { get; set; } = [];
public List<VoiceProviderOption> TextToSpeechProviders { get; set; } = [];
}
+
+public sealed class VoiceActivationModeJsonConverter : JsonConverter<VoiceActivationMode>
+{
+ public override VoiceActivationMode Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options)
+ {
+ var value = reader.GetString();
+ return value switch
+ {
+ "VoiceWake" or "WakeWord" => VoiceActivationMode.VoiceWake,
+ "TalkMode" or "AlwaysOn" => VoiceActivationMode.TalkMode,
+ _ => VoiceActivationMode.Off
+ };
+ }
+
+ public override void Write(Utf8JsonWriter writer, VoiceActivationMode value, JsonSerializerOptions options)
+ {
+ writer.WriteStringValue(value switch
+ {
+ VoiceActivationMode.VoiceWake => "VoiceWake",
+ VoiceActivationMode.TalkMode => "TalkMode",
+ _ => "Off"
+ });
+ }
+}
+
+public sealed class VoiceRuntimeStateJsonConverter : JsonConverter<VoiceRuntimeState>
+{
+ public override VoiceRuntimeState Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options)
+ {
+ var value = reader.GetString();
+ return value switch
+ {
+ "ListeningForVoiceWake" or "ListeningForWakeWord" => VoiceRuntimeState.ListeningForVoiceWake,
+ "Stopped" => VoiceRuntimeState.Stopped,
+ "Paused" => VoiceRuntimeState.Paused,
+ "Idle" => VoiceRuntimeState.Idle,
+ "Arming" => VoiceRuntimeState.Arming,
+ "ListeningContinuously" => VoiceRuntimeState.ListeningContinuously,
+ "RecordingUtterance" => VoiceRuntimeState.RecordingUtterance,
+ "SubmittingAudio" => VoiceRuntimeState.SubmittingAudio,
+ "PendingManualSend" => VoiceRuntimeState.PendingManualSend,
+ "AwaitingResponse" => VoiceRuntimeState.AwaitingResponse,
+ "PlayingResponse" => VoiceRuntimeState.PlayingResponse,
+ "Error" => VoiceRuntimeState.Error,
+ _ => VoiceRuntimeState.Stopped
+ };
+ }
+
+ public override void Write(Utf8JsonWriter writer, VoiceRuntimeState value, JsonSerializerOptions options)
+ {
+ writer.WriteStringValue(value switch
+ {
+ VoiceRuntimeState.ListeningForVoiceWake => "ListeningForVoiceWake",
+ VoiceRuntimeState.Stopped => "Stopped",
+ VoiceRuntimeState.Paused => "Paused",
+ VoiceRuntimeState.Idle => "Idle",
+ VoiceRuntimeState.Arming => "Arming",
+ VoiceRuntimeState.ListeningContinuously => "ListeningContinuously",
+ VoiceRuntimeState.RecordingUtterance => "RecordingUtterance",
+ VoiceRuntimeState.SubmittingAudio => "SubmittingAudio",
+ VoiceRuntimeState.PendingManualSend => "PendingManualSend",
+ VoiceRuntimeState.AwaitingResponse => "AwaitingResponse",
+ VoiceRuntimeState.PlayingResponse => "PlayingResponse",
+ VoiceRuntimeState.Error => "Error",
+ _ => "Stopped"
+ });
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 9cb8914..b6b909f 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -257,7 +257,7 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
_voiceService = new VoiceService(new AppLogger(), _settings);
_voiceChatCoordinator = new VoiceChatCoordinator(
_voiceService,
- () => _settings.Voice.AlwaysOn.ChatWindowSubmitMode,
+ () => _settings.Voice.TalkMode.ChatWindowSubmitMode,
new DispatcherQueueAdapter(_dispatcherQueue!));
_voiceChatCoordinator.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index 181a816..9cd3c05 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -10,8 +10,8 @@
<ComboBox x:Name="VoiceModeComboBox" Header="Mode" SelectionChanged="OnVoiceModeChanged">
<ComboBoxItem Content="Off" Tag="Off"/>
- <ComboBoxItem Content="Voice Wake" Tag="WakeWord" IsEnabled="False"/>
- <ComboBoxItem Content="Talk Mode" Tag="AlwaysOn"/>
+ <ComboBoxItem Content="Voice Wake" Tag="VoiceWake" IsEnabled="False"/>
+ <ComboBoxItem Content="Talk Mode" Tag="TalkMode"/>
</ComboBox>
<ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 968b00d..50f3eb9 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -52,20 +52,20 @@ public void ApplyTo(SettingsManager settings)
SampleRateHz = settings.Voice.SampleRateHz,
CaptureChunkMs = settings.Voice.CaptureChunkMs,
BargeInEnabled = settings.Voice.BargeInEnabled,
- WakeWord = new VoiceWakeWordSettings
+ VoiceWake = new VoiceWakeSettings
{
- Engine = settings.Voice.WakeWord.Engine,
- ModelId = settings.Voice.WakeWord.ModelId,
- TriggerThreshold = settings.Voice.WakeWord.TriggerThreshold,
- TriggerCooldownMs = settings.Voice.WakeWord.TriggerCooldownMs,
- PreRollMs = settings.Voice.WakeWord.PreRollMs,
- EndSilenceMs = settings.Voice.WakeWord.EndSilenceMs
+ Engine = settings.Voice.VoiceWake.Engine,
+ ModelId = settings.Voice.VoiceWake.ModelId,
+ TriggerThreshold = settings.Voice.VoiceWake.TriggerThreshold,
+ TriggerCooldownMs = settings.Voice.VoiceWake.TriggerCooldownMs,
+ PreRollMs = settings.Voice.VoiceWake.PreRollMs,
+ EndSilenceMs = settings.Voice.VoiceWake.EndSilenceMs
},
- AlwaysOn = new VoiceAlwaysOnSettings
+ TalkMode = new TalkModeSettings
{
- MinSpeechMs = settings.Voice.AlwaysOn.MinSpeechMs,
- EndSilenceMs = settings.Voice.AlwaysOn.EndSilenceMs,
- MaxUtteranceMs = settings.Voice.AlwaysOn.MaxUtteranceMs,
+ MinSpeechMs = settings.Voice.TalkMode.MinSpeechMs,
+ EndSilenceMs = settings.Voice.TalkMode.EndSilenceMs,
+ MaxUtteranceMs = settings.Voice.TalkMode.MaxUtteranceMs,
ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
}
};
@@ -82,7 +82,7 @@ private void LoadVoiceSettings()
_voiceProviderConfigurationDraft = _settings.VoiceProviderConfiguration.Clone();
LoadVoiceProviders();
SelectVoiceMode(_settings.Voice.Mode);
- SelectChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode);
+ SelectChatWindowSubmitMode(_settings.Voice.TalkMode.ChatWindowSubmitMode);
VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
UpdateVoiceProviderSettingsEditor();
UpdateVoiceSettingsInfo();
@@ -156,8 +156,8 @@ private void SelectVoiceMode(VoiceActivationMode mode)
{
var target = mode switch
{
- VoiceActivationMode.WakeWord => "WakeWord",
- VoiceActivationMode.AlwaysOn => "AlwaysOn",
+ VoiceActivationMode.VoiceWake => "VoiceWake",
+ VoiceActivationMode.TalkMode => "TalkMode",
_ => "Off"
};
@@ -178,8 +178,8 @@ private VoiceActivationMode GetSelectedVoiceMode()
var tag = (VoiceModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
return tag switch
{
- "WakeWord" => VoiceActivationMode.WakeWord,
- "AlwaysOn" => VoiceActivationMode.AlwaysOn,
+ "VoiceWake" => VoiceActivationMode.VoiceWake,
+ "TalkMode" => VoiceActivationMode.TalkMode,
_ => VoiceActivationMode.Off
};
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
index 3921863..496c643 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
@@ -8,8 +8,8 @@ public static string GetModeLabel(VoiceActivationMode mode)
{
return mode switch
{
- VoiceActivationMode.WakeWord => "Voice Wake",
- VoiceActivationMode.AlwaysOn => "Talk Mode",
+ VoiceActivationMode.VoiceWake => "Voice Wake",
+ VoiceActivationMode.TalkMode => "Talk Mode",
_ => "Off"
};
}
@@ -19,7 +19,7 @@ public static string GetStateLabel(VoiceRuntimeState state)
return state switch
{
VoiceRuntimeState.Arming => "Starting",
- VoiceRuntimeState.ListeningForWakeWord => "Listening",
+ VoiceRuntimeState.ListeningForVoiceWake => "Listening",
VoiceRuntimeState.ListeningContinuously => "Listening",
VoiceRuntimeState.RecordingUtterance => "Recording",
VoiceRuntimeState.SubmittingAudio => "Sending",
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 8822916..57ee68b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -233,19 +233,19 @@ public async Task<VoiceStatusInfo> StartAsync(VoiceStartArgs args)
{
switch (requestedMode)
{
- case VoiceActivationMode.AlwaysOn:
- await StartAlwaysOnRuntimeAsync(effectiveSettings, sessionKey);
+ case VoiceActivationMode.TalkMode:
+ await StartTalkModeRuntimeAsync(effectiveSettings, sessionKey);
break;
- case VoiceActivationMode.WakeWord:
+ case VoiceActivationMode.VoiceWake:
lock (_gate)
{
_status = BuildRunningStatus(
- VoiceActivationMode.WakeWord,
+ VoiceActivationMode.VoiceWake,
sessionKey,
- VoiceRuntimeState.ListeningForWakeWord,
- "WakeWord capture is not implemented yet");
+ VoiceRuntimeState.ListeningForVoiceWake,
+ "Voice Wake capture is not implemented yet");
}
- _logger.Info("Voice runtime started in mode WakeWord");
+ _logger.Info("Voice runtime started in mode VoiceWake");
break;
default:
lock (_gate)
@@ -367,7 +367,7 @@ public void Dispose()
}
}
- private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? sessionKey)
+ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? sessionKey)
{
var effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey;
var selectedSpeechToText = VoiceProviderCatalogService.ResolveSpeechToTextProvider(
@@ -394,12 +394,12 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
{
- _logger.Warn("Selected input device is saved, but AlwaysOn currently uses the system speech input device.");
+ _logger.Warn("Selected input device is saved, but Talk Mode currently uses the system speech input device.");
}
if (!string.IsNullOrWhiteSpace(settings.OutputDeviceId))
{
- _logger.Warn("Selected output device is saved, but AlwaysOn currently uses the default speech output device.");
+ _logger.Warn("Selected output device is saved, but Talk Mode currently uses the default speech output device.");
}
recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
@@ -413,7 +413,7 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
_speechSynthesizer = synthesizer;
_mediaPlayer = player;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
effectiveSessionKey,
VoiceRuntimeState.Arming,
fallbackMessage);
@@ -427,14 +427,14 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
if (_status.Running)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
effectiveSessionKey,
VoiceRuntimeState.ListeningContinuously,
fallbackMessage);
}
}
- _logger.Info("Voice runtime started in mode AlwaysOn");
+ _logger.Info("Voice runtime started in mode TalkMode");
}
catch
{
@@ -470,7 +470,7 @@ private async Task StartAlwaysOnRuntimeAsync(VoiceSettings settings, string? ses
private async Task<SpeechRecognizer> CreateSpeechRecognizerAsync(VoiceSettings settings)
{
var recognizer = new SpeechRecognizer();
- recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromMilliseconds(settings.AlwaysOn.EndSilenceMs);
+ recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromMilliseconds(settings.TalkMode.EndSilenceMs);
recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(10);
recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(4);
recognizer.Constraints.Add(new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.Dictation, "always-on-dictation"));
@@ -590,7 +590,7 @@ private async Task StartRecognitionSessionAsync()
if (_status.Running && !_awaitingReply && !_isSpeaking)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
null);
@@ -651,7 +651,7 @@ private async void OnSpeechResultGenerated(
{
if (_status.Running)
{
- _status = BuildErrorStatus(VoiceActivationMode.AlwaysOn, _status.SessionKey, GetUserFacingErrorMessage(ex));
+ _status = BuildErrorStatus(VoiceActivationMode.TalkMode, _status.SessionKey, GetUserFacingErrorMessage(ex));
}
}
}
@@ -665,7 +665,7 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
lock (_gate)
{
if (_runtimeCts == null ||
- _status.Mode != VoiceActivationMode.AlwaysOn ||
+ _status.Mode != VoiceActivationMode.TalkMode ||
!_status.Running ||
_awaitingReply ||
_isSpeaking)
@@ -687,12 +687,12 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
private async Task HandleRecognizedTextAsync(string text)
{
- CancellationToken cancellationToken;
- string sessionKey;
+ CancellationToken cancellationToken;
+ string sessionKey;
- lock (_gate)
- {
- if (_runtimeCts == null || _status.Mode != VoiceActivationMode.AlwaysOn || !_status.Running)
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _status.Mode != VoiceActivationMode.TalkMode || !_status.Running)
{
return;
}
@@ -712,7 +712,7 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscriptUtc = DateTime.UtcNow;
cancellationToken = _runtimeCts.Token;
sessionKey = GetCurrentVoiceSessionKey();
- }
+ }
RaiseTranscriptDraft(text, sessionKey, clear: false);
@@ -755,7 +755,7 @@ private async Task HandleRecognizedTextAsync(string text)
_awaitingReply = false;
_pendingManualTranscript = text;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.PendingManualSend,
"Draft ready in tray chat window. Send it manually to continue.");
@@ -771,7 +771,7 @@ private async Task HandleRecognizedTextAsync(string text)
_awaitingReply = true;
_pendingManualTranscript = null;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.AwaitingResponse,
_status.LastError);
@@ -790,7 +790,7 @@ private async Task HandleRecognizedTextAsync(string text)
{
_awaitingReply = false;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
GetUserFacingErrorMessage(ex));
@@ -807,7 +807,7 @@ public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = nu
lock (_gate)
{
- if (_runtimeCts == null || _status.Mode != VoiceActivationMode.AlwaysOn || !_status.Running)
+ if (_runtimeCts == null || _status.Mode != VoiceActivationMode.TalkMode || !_status.Running)
{
return;
}
@@ -817,7 +817,7 @@ public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = nu
_pendingManualTranscript = null;
_awaitingReply = true;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.AwaitingResponse,
_status.LastError);
@@ -843,7 +843,7 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
{
_awaitingReply = false;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
"Timed out waiting for an assistant reply.");
@@ -876,7 +876,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
lock (_gate)
{
- if (!_awaitingReply || !_status.Running || _status.Mode != VoiceActivationMode.AlwaysOn)
+ if (!_awaitingReply || !_status.Running || _status.Mode != VoiceActivationMode.TalkMode)
{
return;
}
@@ -889,7 +889,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
_awaitingReply = false;
_isSpeaking = true;
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.PlayingResponse,
_status.LastError);
@@ -904,7 +904,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
if (_status.Running)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
_status.LastError);
@@ -926,7 +926,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
lock (_gate)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
GetUserFacingErrorMessage(ex));
@@ -940,7 +940,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
if (_status.Running)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
_status.LastError);
@@ -1053,7 +1053,7 @@ private async void OnSpeechRecognitionCompleted(
_recognitionActive = false;
token = _runtimeCts.Token;
shouldRestart = _status.Running &&
- _status.Mode == VoiceActivationMode.AlwaysOn &&
+ _status.Mode == VoiceActivationMode.TalkMode &&
!_awaitingReply &&
!_isSpeaking;
}
@@ -1084,26 +1084,26 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
_transportReadyTcs?.TrySetResult(true);
if (_status.Running &&
- _status.Mode == VoiceActivationMode.AlwaysOn &&
+ _status.Mode == VoiceActivationMode.TalkMode &&
!_awaitingReply &&
!_isSpeaking)
{
- _status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
- _status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- _status.LastError);
- }
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ }
}
else if (status == ConnectionStatus.Error)
{
_transportReadyTcs?.TrySetException(
new InvalidOperationException("Voice chat transport failed to connect."));
- if (_status.Running && _status.Mode == VoiceActivationMode.AlwaysOn)
+ if (_status.Running && _status.Mode == VoiceActivationMode.TalkMode)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.Arming,
"Voice chat transport failed.");
@@ -1111,10 +1111,10 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
}
else if (status == ConnectionStatus.Disconnected)
{
- if (_status.Running && _status.Mode == VoiceActivationMode.AlwaysOn)
+ if (_status.Running && _status.Mode == VoiceActivationMode.TalkMode)
{
_status = BuildRunningStatus(
- VoiceActivationMode.AlwaysOn,
+ VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.Arming,
"Voice chat transport disconnected.");
@@ -1277,9 +1277,9 @@ private VoiceStatusInfo BuildRunningStatus(
SessionKey = sessionKey,
InputDeviceId = settings.InputDeviceId,
OutputDeviceId = settings.OutputDeviceId,
- WakeWordModelId = settings.WakeWord.ModelId,
- WakeWordLoaded = mode == VoiceActivationMode.WakeWord,
- LastWakeWordUtc = _status.LastWakeWordUtc,
+ VoiceWakeModelId = settings.VoiceWake.ModelId,
+ VoiceWakeLoaded = mode == VoiceActivationMode.VoiceWake,
+ LastVoiceWakeUtc = _status.LastVoiceWakeUtc,
LastUtteranceUtc = _status.LastUtteranceUtc,
LastError = lastError
};
@@ -1297,9 +1297,9 @@ private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
SessionKey = sessionKey,
InputDeviceId = settings.InputDeviceId,
OutputDeviceId = settings.OutputDeviceId,
- WakeWordModelId = settings.WakeWord.ModelId,
- WakeWordLoaded = false,
- LastWakeWordUtc = _status.LastWakeWordUtc,
+ VoiceWakeModelId = settings.VoiceWake.ModelId,
+ VoiceWakeLoaded = false,
+ LastVoiceWakeUtc = _status.LastVoiceWakeUtc,
LastUtteranceUtc = _status.LastUtteranceUtc,
LastError = reason
};
@@ -1317,9 +1317,9 @@ private VoiceStatusInfo BuildPausedStatus(VoiceActivationMode mode, string? sess
SessionKey = sessionKey,
InputDeviceId = settings.InputDeviceId,
OutputDeviceId = settings.OutputDeviceId,
- WakeWordModelId = settings.WakeWord.ModelId,
- WakeWordLoaded = false,
- LastWakeWordUtc = _status.LastWakeWordUtc,
+ VoiceWakeModelId = settings.VoiceWake.ModelId,
+ VoiceWakeLoaded = false,
+ LastVoiceWakeUtc = _status.LastVoiceWakeUtc,
LastUtteranceUtc = _status.LastUtteranceUtc,
LastError = reason
};
@@ -1346,21 +1346,21 @@ private static VoiceSettings Clone(VoiceSettings source)
SampleRateHz = source.SampleRateHz,
CaptureChunkMs = source.CaptureChunkMs,
BargeInEnabled = source.BargeInEnabled,
- WakeWord = new VoiceWakeWordSettings
+ VoiceWake = new VoiceWakeSettings
{
- Engine = source.WakeWord.Engine,
- ModelId = source.WakeWord.ModelId,
- TriggerThreshold = source.WakeWord.TriggerThreshold,
- TriggerCooldownMs = source.WakeWord.TriggerCooldownMs,
- PreRollMs = source.WakeWord.PreRollMs,
- EndSilenceMs = source.WakeWord.EndSilenceMs
+ Engine = source.VoiceWake.Engine,
+ ModelId = source.VoiceWake.ModelId,
+ TriggerThreshold = source.VoiceWake.TriggerThreshold,
+ TriggerCooldownMs = source.VoiceWake.TriggerCooldownMs,
+ PreRollMs = source.VoiceWake.PreRollMs,
+ EndSilenceMs = source.VoiceWake.EndSilenceMs
},
- AlwaysOn = new VoiceAlwaysOnSettings
+ TalkMode = new TalkModeSettings
{
- MinSpeechMs = source.AlwaysOn.MinSpeechMs,
- EndSilenceMs = source.AlwaysOn.EndSilenceMs,
- MaxUtteranceMs = source.AlwaysOn.MaxUtteranceMs,
- ChatWindowSubmitMode = source.AlwaysOn.ChatWindowSubmitMode
+ MinSpeechMs = source.TalkMode.MinSpeechMs,
+ EndSilenceMs = source.TalkMode.EndSilenceMs,
+ MaxUtteranceMs = source.TalkMode.MaxUtteranceMs,
+ ChatWindowSubmitMode = source.TalkMode.ChatWindowSubmitMode
}
};
}
@@ -1376,9 +1376,9 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
SessionKey = source.SessionKey,
InputDeviceId = source.InputDeviceId,
OutputDeviceId = source.OutputDeviceId,
- WakeWordModelId = source.WakeWordModelId,
- WakeWordLoaded = source.WakeWordLoaded,
- LastWakeWordUtc = source.LastWakeWordUtc,
+ VoiceWakeModelId = source.VoiceWakeModelId,
+ VoiceWakeLoaded = source.VoiceWakeLoaded,
+ LastVoiceWakeUtc = source.LastVoiceWakeUtc,
LastUtteranceUtc = source.LastUtteranceUtc,
LastError = source.LastError
};
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index 062f261..1029c90 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -56,14 +56,14 @@ public void RefreshStatus()
new("Text to speech", ResolveProviderName(catalog.TextToSpeechProviders, _settings.Voice.TextToSpeechProviderId, "Windows Speech Synthesis")),
new("Listen device", DescribeDevice(_settings.Voice.InputDeviceId, "System default microphone")),
new("Talk device", DescribeDevice(_settings.Voice.OutputDeviceId, "System default speaker")),
- new("Chat window", DescribeChatWindowSubmitMode(_settings.Voice.AlwaysOn.ChatWindowSubmitMode)),
+ new("Chat window", DescribeChatWindowSubmitMode(_settings.Voice.TalkMode.ChatWindowSubmitMode)),
new("Voice toasts", _settings.Voice.ShowConversationToasts ? "Enabled" : "Disabled")
};
RecentItemsControl.ItemsSource = new List<DetailRow>
{
new("Last utterance", FormatTimestamp(running.LastUtteranceUtc)),
- new("Last wake", FormatTimestamp(running.LastWakeWordUtc)),
+ new("Last wake", FormatTimestamp(running.LastVoiceWakeUtc)),
new("Last issue", string.IsNullOrWhiteSpace(running.LastError) ? "None" : running.LastError!)
};
diff --git a/tests/OpenClaw.Shared.Tests/CapabilityTests.cs b/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
index edcd834..89158ea 100644
--- a/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
+++ b/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
@@ -990,7 +990,7 @@ public async Task GetSettings_ReturnsSettingsFromHandler()
cap.SettingsRequested += () => Task.FromResult(new VoiceSettings
{
Enabled = true,
- Mode = VoiceActivationMode.WakeWord
+ Mode = VoiceActivationMode.VoiceWake
});
var res = await cap.ExecuteAsync(new NodeInvokeRequest
@@ -1004,7 +1004,7 @@ public async Task GetSettings_ReturnsSettingsFromHandler()
var json = JsonSerializer.Serialize(res.Payload);
using var doc = JsonDocument.Parse(json);
Assert.True(doc.RootElement.GetProperty("Enabled").GetBoolean());
- Assert.Equal("WakeWord", doc.RootElement.GetProperty("Mode").GetString());
+ Assert.Equal("VoiceWake", doc.RootElement.GetProperty("Mode").GetString());
}
[Fact]
@@ -1022,13 +1022,13 @@ public async Task SetSettings_UsesUpdateEnvelope_WhenPresent()
{
Id = "voice3",
Command = VoiceCommands.SetSettings,
- Args = Parse("""{"update":{"persist":false,"settings":{"enabled":true,"mode":"AlwaysOn"}}}""")
+ Args = Parse("""{"update":{"persist":false,"settings":{"enabled":true,"mode":"TalkMode"}}}""")
});
Assert.True(res.Ok);
Assert.NotNull(received);
Assert.False(received!.Persist);
- Assert.Equal(VoiceActivationMode.AlwaysOn, received.Settings.Mode);
+ Assert.Equal(VoiceActivationMode.TalkMode, received.Settings.Mode);
}
[Fact]
@@ -1039,7 +1039,7 @@ public async Task GetStatus_ReturnsStatusFromHandler()
{
Available = true,
Running = true,
- Mode = VoiceActivationMode.AlwaysOn,
+ Mode = VoiceActivationMode.TalkMode,
State = VoiceRuntimeState.ListeningContinuously
});
@@ -1070,7 +1070,7 @@ public async Task Start_PassesArgsToHandler()
Available = true,
Running = true,
Mode = args.Mode ?? VoiceActivationMode.Off,
- State = VoiceRuntimeState.ListeningForWakeWord,
+ State = VoiceRuntimeState.ListeningForVoiceWake,
SessionKey = args.SessionKey
});
};
@@ -1079,12 +1079,12 @@ public async Task Start_PassesArgsToHandler()
{
Id = "voice5",
Command = VoiceCommands.Start,
- Args = Parse("""{"mode":"WakeWord","sessionKey":"session-123"}""")
+ Args = Parse("""{"mode":"VoiceWake","sessionKey":"session-123"}""")
});
Assert.True(res.Ok);
Assert.NotNull(received);
- Assert.Equal(VoiceActivationMode.WakeWord, received!.Mode);
+ Assert.Equal(VoiceActivationMode.VoiceWake, received!.Mode);
Assert.Equal("session-123", received.SessionKey);
}
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index e566f46..807ac61 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -36,11 +36,11 @@ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
Assert.Equal(16000, settings.SampleRateHz);
Assert.Equal(80, settings.CaptureChunkMs);
Assert.True(settings.BargeInEnabled);
- Assert.Equal("NanoWakeWord", settings.WakeWord.Engine);
- Assert.Equal("hey_openclaw", settings.WakeWord.ModelId);
- Assert.Equal(0.65f, settings.WakeWord.TriggerThreshold);
- Assert.Equal(250, settings.AlwaysOn.MinSpeechMs);
- Assert.Equal(VoiceChatWindowSubmitMode.AutoSend, settings.AlwaysOn.ChatWindowSubmitMode);
+ Assert.Equal("NanoWakeWord", settings.VoiceWake.Engine);
+ Assert.Equal("hey_openclaw", settings.VoiceWake.ModelId);
+ Assert.Equal(0.65f, settings.VoiceWake.TriggerThreshold);
+ Assert.Equal(250, settings.TalkMode.MinSpeechMs);
+ Assert.Equal(VoiceChatWindowSubmitMode.AutoSend, settings.TalkMode.ChatWindowSubmitMode);
}
[Fact]
@@ -52,7 +52,7 @@ public void VoiceStatusInfo_Defaults_ToStopped()
Assert.False(status.Running);
Assert.Equal(VoiceActivationMode.Off, status.Mode);
Assert.Equal(VoiceRuntimeState.Stopped, status.State);
- Assert.False(status.WakeWordLoaded);
+ Assert.False(status.VoiceWakeLoaded);
Assert.Null(status.LastError);
}
@@ -61,10 +61,10 @@ public void VoiceEnums_Serialize_AsStrings()
{
var json = JsonSerializer.Serialize(new VoiceStartArgs
{
- Mode = VoiceActivationMode.WakeWord
+ Mode = VoiceActivationMode.VoiceWake
});
- Assert.Contains("\"WakeWord\"", json);
+ Assert.Contains("\"VoiceWake\"", json);
}
[Fact]
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 65062fa..86e576f 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -32,7 +32,7 @@ public void RoundTrip_AllFields_Preserved()
Voice = new VoiceSettings
{
Enabled = true,
- Mode = VoiceActivationMode.WakeWord,
+ Mode = VoiceActivationMode.VoiceWake,
ShowConversationToasts = true,
SpeechToTextProviderId = "windows",
TextToSpeechProviderId = "elevenlabs",
@@ -41,7 +41,7 @@ public void RoundTrip_AllFields_Preserved()
SampleRateHz = 16000,
CaptureChunkMs = 80,
BargeInEnabled = false,
- WakeWord = new VoiceWakeWordSettings
+ VoiceWake = new VoiceWakeSettings
{
Engine = "NanoWakeWord",
ModelId = "hey_openclaw",
@@ -50,7 +50,7 @@ public void RoundTrip_AllFields_Preserved()
PreRollMs = 1400,
EndSilenceMs = 1000
},
- AlwaysOn = new VoiceAlwaysOnSettings
+ TalkMode = new TalkModeSettings
{
MinSpeechMs = 300,
EndSilenceMs = 1100,
@@ -114,17 +114,17 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal(original.PreferStructuredCategories, restored.PreferStructuredCategories);
Assert.NotNull(restored.Voice);
Assert.True(restored.Voice.Enabled);
- Assert.Equal(VoiceActivationMode.WakeWord, restored.Voice.Mode);
+ Assert.Equal(VoiceActivationMode.VoiceWake, restored.Voice.Mode);
Assert.True(restored.Voice.ShowConversationToasts);
Assert.Equal("windows", restored.Voice.SpeechToTextProviderId);
Assert.Equal("elevenlabs", restored.Voice.TextToSpeechProviderId);
Assert.Equal("mic-1", restored.Voice.InputDeviceId);
Assert.Equal("spk-2", restored.Voice.OutputDeviceId);
- Assert.Equal("NanoWakeWord", restored.Voice.WakeWord.Engine);
- Assert.Equal("hey_openclaw", restored.Voice.WakeWord.ModelId);
- Assert.Equal(0.72f, restored.Voice.WakeWord.TriggerThreshold);
- Assert.Equal(300, restored.Voice.AlwaysOn.MinSpeechMs);
- Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.AlwaysOn.ChatWindowSubmitMode);
+ Assert.Equal("NanoWakeWord", restored.Voice.VoiceWake.Engine);
+ Assert.Equal("hey_openclaw", restored.Voice.VoiceWake.ModelId);
+ Assert.Equal(0.72f, restored.Voice.VoiceWake.TriggerThreshold);
+ Assert.Equal(300, restored.Voice.TalkMode.MinSpeechMs);
+ Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.TalkMode.ChatWindowSubmitMode);
Assert.NotNull(restored.VoiceProviderConfiguration);
Assert.Equal("minimax-key", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.ApiKey));
Assert.Equal("speech-2.8-turbo", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.Model));
@@ -186,7 +186,7 @@ public void MissingFields_UseDefaults()
Assert.NotNull(settings.VoiceProviderConfiguration);
Assert.Empty(settings.VoiceProviderConfiguration.Providers);
Assert.Equal(16000, settings.Voice.SampleRateHz);
- Assert.Equal("NanoWakeWord", settings.Voice.WakeWord.Engine);
+ Assert.Equal("NanoWakeWord", settings.Voice.VoiceWake.Engine);
Assert.Null(settings.UserRules);
}
From 47efc3e741508275a9f872a78e8f5348e4a8e56b Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 20:27:07 +0000
Subject: [PATCH 19/83] Move voice settings below node mode toggle
---
src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
index a51f028..2b8699d 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml
@@ -85,8 +85,6 @@
<Button x:Name="TestNotificationButton" x:Uid="SettingsTestNotificationButton" Content="Send Test Notification"
Click="OnTestNotification"/>
- <controls:VoiceSettingsPanel x:Name="VoiceSettingsPanel"/>
-
<!-- Advanced Section -->
<StackPanel Spacing="8">
<TextBlock x:Uid="SettingsAdvancedHeader" Text="ADVANCED (EXPERIMENTAL)" Style="{StaticResource CaptionTextBlockStyle}"
@@ -98,6 +96,8 @@
Foreground="{ThemeResource TextFillColorSecondaryBrush}"
TextWrapping="Wrap"
Margin="0,-4,0,0"/>
+
+ <controls:VoiceSettingsPanel x:Name="VoiceSettingsPanel"/>
</StackPanel>
</StackPanel>
From 85d7b906f1832b7512c3405f4765e6ad72789519 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 21:16:27 +0000
Subject: [PATCH 20/83] Make cloud TTS voice settings fully catalog-driven
---
docs/VOICE-MODE.md | 71 ++++++++++++++---
src/OpenClaw.Shared/VoiceModeSchema.cs | 4 +
.../Controls/VoiceSettingsPanel.xaml | 12 +++
.../Controls/VoiceSettingsPanel.xaml.cs | 62 +++++++++++++--
.../Voice/VoiceCloudTextToSpeechClient.cs | 76 +++++++++++++++----
.../Voice/VoiceProviderCatalogService.cs | 62 ++++++++++++---
.../VoiceModeSchemaTests.cs | 1 +
.../SettingsRoundTripTests.cs | 4 +-
.../VoiceProviderCatalogServiceTests.cs | 21 ++++-
9 files changed, 266 insertions(+), 47 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index e7c7154..21b960a 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -167,14 +167,33 @@ Example:
},
{
"id": "minimax",
- "name": "MiniMax Speech 2.8 Turbo",
+ "name": "MiniMax",
"runtime": "cloud",
"enabled": true,
- "description": "Cloud TTS using MiniMax HTTP text-to-speech.",
+ "description": "Cloud TTS using the MiniMax HTTP text-to-speech API.",
"settings": [
{ "key": "apiKey", "label": "API key", "secret": true },
- { "key": "model", "label": "Model", "defaultValue": "speech-2.8-turbo" },
- { "key": "voiceId", "label": "Voice ID", "defaultValue": "English_MatureBoss" }
+ {
+ "key": "model",
+ "label": "Model",
+ "defaultValue": "speech-2.8-turbo",
+ "options": [
+ "speech-2.5-turbo-preview",
+ "speech-02-turbo",
+ "speech-02-hd",
+ "speech-2.6-turbo",
+ "speech-2.6-hd",
+ "speech-2.8-turbo",
+ "speech-2.8-hd"
+ ]
+ },
+ { "key": "voiceId", "label": "Voice ID", "defaultValue": "English_MatureBoss" },
+ {
+ "key": "voiceSettingsJson",
+ "label": "Voice settings JSON",
+ "defaultValue": "\"voice_setting\": { \"voice_id\": {{voiceId}}, \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
+ "placeholder": "\"voice_setting\": { \"voice_id\": \"English_MatureBoss\", \"speed\": 1, \"vol\": 1, \"pitch\": 0 }"
+ }
],
"textToSpeechHttp": {
"endpointTemplate": "https://api.minimax.io/v1/t2a_v2",
@@ -183,7 +202,7 @@ Example:
"authenticationScheme": "Bearer",
"apiKeySettingKey": "apiKey",
"requestContentType": "application/json",
- "requestBodyTemplate": "{ \"model\": {{model}}, \"text\": {{text}}, \"stream\": false, \"language_boost\": \"English\", \"output_format\": \"hex\", \"voice_setting\": { \"voice_id\": {{voiceId}}, \"speed\": 1, \"vol\": 1, \"pitch\": 0 }, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "requestBodyTemplate": "{ \"model\": {{model}}, \"text\": {{text}}, \"stream\": false, \"language_boost\": \"English\", \"output_format\": \"hex\", {{voiceSettingsJson}}, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
"responseAudioMode": "hexJsonString",
"responseAudioJsonPath": "data.audio",
"responseStatusCodeJsonPath": "base_resp.status_code",
@@ -200,8 +219,24 @@ Example:
"description": "Cloud TTS using the ElevenLabs create speech API.",
"settings": [
{ "key": "apiKey", "label": "API key", "secret": true },
- { "key": "model", "label": "Model", "defaultValue": "eleven_multilingual_v2" },
- { "key": "voiceId", "label": "Voice ID", "placeholder": "Enter an ElevenLabs voice ID" }
+ {
+ "key": "model",
+ "label": "Model",
+ "defaultValue": "eleven_multilingual_v2",
+ "options": [
+ "eleven_flash_v2_5",
+ "eleven_turbo_v2_5",
+ "eleven_multilingual_v2",
+ "eleven_monolingual_v1"
+ ]
+ },
+ { "key": "voiceId", "label": "Voice ID", "placeholder": "Enter an ElevenLabs voice ID" },
+ {
+ "key": "voiceSettingsJson",
+ "label": "Voice settings JSON",
+ "defaultValue": "\"voice_settings\": null",
+ "placeholder": "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }"
+ }
],
"textToSpeechHttp": {
"endpointTemplate": "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
@@ -209,7 +244,7 @@ Example:
"authenticationHeaderName": "xi-api-key",
"apiKeySettingKey": "apiKey",
"requestContentType": "application/json",
- "requestBodyTemplate": "{ \"text\": {{text}}, \"model_id\": {{model}} }",
+ "requestBodyTemplate": "{ \"text\": {{text}}, \"model_id\": {{model}}, {{voiceSettingsJson}} }",
"responseAudioMode": "binary",
"outputContentType": "audio/mpeg"
}
@@ -238,12 +273,23 @@ Current configuration values are keyed by provider id. The built-in providers us
- `apiKey`
- `model`
- `voiceId`
+- `voiceSettingsJson`
-When the selected TTS provider in the Voice Mode window is not `windows`, the tray app shows provider-specific fields in the configuration form so the user can enter or edit:
+When the selected TTS provider in Settings is not `windows`, the tray app shows provider-specific fields in the configuration form so the user can enter or edit:
- API key
- model
- voice id
+- voice settings JSON
+
+If a provider setting definition includes an `options` list, the settings UI renders that setting as a drop-down instead of a free-text field. That is how built-in cloud providers expose a provider-level choice plus a separate model choice without recompilation.
+
+If a provider setting definition is marked as JSON, the value is inserted into the provider request template as a raw JSON fragment rather than a quoted string. That allows the provider catalog to define whether the user is entering:
+
+- a bare object
+- or a full keyed fragment such as `"voice_setting": { ... }`
+
+without hard-coding provider-specific wrapper keys into the runtime.
For `VoiceWake`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
@@ -319,7 +365,8 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"Values": {
"apiKey": "<local secret>",
"model": "speech-2.8-turbo",
- "voiceId": "English_MatureBoss"
+ "voiceId": "English_MatureBoss",
+ "voiceSettingsJson": "\"voice_setting\": { \"voice_id\": \"English_MatureBoss\", \"speed\": 1, \"vol\": 1, \"pitch\": 0 }"
}
},
{
@@ -327,7 +374,8 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"Values": {
"apiKey": "<local secret>",
"model": "eleven_multilingual_v2",
- "voiceId": "voice-id"
+ "voiceId": "voice-id",
+ "voiceSettingsJson": "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }"
}
}
]
@@ -377,6 +425,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
| `VoiceProviderConfiguration.Providers[].Values["model"]` | string? | provider default | cloud providers | Model identifier inserted into the configured request template |
| `VoiceProviderConfiguration.Providers[].Values["voiceId"]` | string? | provider default | cloud providers | Voice id inserted into the configured request template or URL |
+| `VoiceProviderConfiguration.Providers[].Values["voiceSettingsJson"]` | string? | provider default | cloud providers | Raw JSON fragment inserted into the configured request template; may be a keyed fragment like `"voice_setting": { ... }` |
At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `TalkMode` path still uses the Windows system speech stack defaults for capture and playback.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index d3d6d42..24dd672 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -147,6 +147,7 @@ public static class VoiceProviderSettingKeys
public const string ApiKey = "apiKey";
public const string Model = "model";
public const string VoiceId = "voiceId";
+ public const string VoiceSettingsJson = "voiceSettingsJson";
}
public static class VoiceTextToSpeechResponseModes
@@ -182,9 +183,12 @@ public sealed class VoiceProviderSettingDefinition
public string Key { get; set; } = "";
public string Label { get; set; } = "";
public bool Secret { get; set; }
+ public bool Required { get; set; } = true;
+ public bool JsonValue { get; set; }
public string? DefaultValue { get; set; }
public string? Placeholder { get; set; }
public string? Description { get; set; }
+ public List<string> Options { get; set; } = [];
}
public sealed class VoiceTextToSpeechHttpContract
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index 9cd3c05..ad5e95d 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -31,12 +31,24 @@
<PasswordBox x:Name="VoiceTtsApiKeyPasswordBox"
Header="API key"
PasswordChanged="OnVoiceProviderSettingsChanged"/>
+ <ComboBox x:Name="VoiceTtsModelComboBox"
+ Header="Model"
+ Visibility="Collapsed"
+ SelectionChanged="OnVoiceProviderSettingsChanged"/>
<TextBox x:Name="VoiceTtsModelTextBox"
Header="Model"
+ Visibility="Collapsed"
TextChanged="OnVoiceProviderSettingsChanged"/>
<TextBox x:Name="VoiceTtsVoiceIdTextBox"
Header="Voice ID"
TextChanged="OnVoiceProviderSettingsChanged"/>
+ <TextBox x:Name="VoiceTtsVoiceSettingsJsonTextBox"
+ Header="Voice settings JSON"
+ Visibility="Collapsed"
+ AcceptsReturn="True"
+ TextWrapping="Wrap"
+ MinHeight="96"
+ TextChanged="OnVoiceProviderSettingsChanged"/>
</StackPanel>
<ComboBox x:Name="VoiceInputDeviceComboBox" Header="Listen device (microphone)" DisplayMemberPath="Name"/>
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 50f3eb9..70dfc46 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -21,6 +21,7 @@ public sealed partial class VoiceSettingsPanel : UserControl
private List<VoiceProviderOption> _textToSpeechOptions = new();
private List<DeviceOption> _inputOptions = new();
private List<DeviceOption> _outputOptions = new();
+ private List<string> _activeTtsModelOptions = new();
public VoiceSettingsPanel()
{
@@ -242,6 +243,8 @@ private void UpdateVoiceProviderSettingsEditor()
var apiKeySetting = FindSetting(provider, VoiceProviderSettingKeys.ApiKey);
var modelSetting = FindSetting(provider, VoiceProviderSettingKeys.Model);
var voiceIdSetting = FindSetting(provider, VoiceProviderSettingKeys.VoiceId);
+ var voiceSettingsJsonSetting = FindSetting(provider, VoiceProviderSettingKeys.VoiceSettingsJson);
+ var modelValue = GetProviderValue(providerId, modelSetting) ?? string.Empty;
_updatingVoiceProviderFields = true;
try
@@ -251,15 +254,48 @@ private void UpdateVoiceProviderSettingsEditor()
VoiceTtsApiKeyPasswordBox.Visibility = apiKeySetting != null ? Visibility.Visible : Visibility.Collapsed;
VoiceTtsApiKeyPasswordBox.Password = GetProviderValue(providerId, apiKeySetting) ?? string.Empty;
- VoiceTtsModelTextBox.Header = modelSetting?.Label ?? "Model";
- VoiceTtsModelTextBox.PlaceholderText = modelSetting?.Placeholder ?? string.Empty;
- VoiceTtsModelTextBox.Visibility = modelSetting != null ? Visibility.Visible : Visibility.Collapsed;
- VoiceTtsModelTextBox.Text = GetProviderValue(providerId, modelSetting) ?? string.Empty;
+ _activeTtsModelOptions = modelSetting?.Options
+ .Where(option => !string.IsNullOrWhiteSpace(option))
+ .Distinct(StringComparer.OrdinalIgnoreCase)
+ .ToList()
+ ?? [];
+
+ if (_activeTtsModelOptions.Count > 0)
+ {
+ if (!string.IsNullOrWhiteSpace(modelValue) &&
+ !_activeTtsModelOptions.Contains(modelValue, StringComparer.OrdinalIgnoreCase))
+ {
+ _activeTtsModelOptions.Insert(0, modelValue);
+ }
+
+ VoiceTtsModelComboBox.Header = modelSetting?.Label ?? "Model";
+ VoiceTtsModelComboBox.ItemsSource = _activeTtsModelOptions;
+ VoiceTtsModelComboBox.SelectedItem = _activeTtsModelOptions
+ .FirstOrDefault(option => string.Equals(option, modelValue, StringComparison.OrdinalIgnoreCase))
+ ?? _activeTtsModelOptions.FirstOrDefault();
+ VoiceTtsModelComboBox.Visibility = Visibility.Visible;
+ VoiceTtsModelTextBox.Visibility = Visibility.Collapsed;
+ }
+ else
+ {
+ VoiceTtsModelTextBox.Header = modelSetting?.Label ?? "Model";
+ VoiceTtsModelTextBox.PlaceholderText = modelSetting?.Placeholder ?? string.Empty;
+ VoiceTtsModelTextBox.Visibility = modelSetting != null ? Visibility.Visible : Visibility.Collapsed;
+ VoiceTtsModelTextBox.Text = modelValue;
+ VoiceTtsModelComboBox.ItemsSource = null;
+ VoiceTtsModelComboBox.SelectedItem = null;
+ VoiceTtsModelComboBox.Visibility = Visibility.Collapsed;
+ }
VoiceTtsVoiceIdTextBox.Header = voiceIdSetting?.Label ?? "Voice ID";
VoiceTtsVoiceIdTextBox.PlaceholderText = voiceIdSetting?.Placeholder ?? string.Empty;
VoiceTtsVoiceIdTextBox.Visibility = voiceIdSetting != null ? Visibility.Visible : Visibility.Collapsed;
VoiceTtsVoiceIdTextBox.Text = GetProviderValue(providerId, voiceIdSetting) ?? string.Empty;
+
+ VoiceTtsVoiceSettingsJsonTextBox.Header = voiceSettingsJsonSetting?.Label ?? "Voice settings JSON";
+ VoiceTtsVoiceSettingsJsonTextBox.PlaceholderText = voiceSettingsJsonSetting?.Placeholder ?? string.Empty;
+ VoiceTtsVoiceSettingsJsonTextBox.Visibility = voiceSettingsJsonSetting != null ? Visibility.Visible : Visibility.Collapsed;
+ VoiceTtsVoiceSettingsJsonTextBox.Text = GetProviderValue(providerId, voiceSettingsJsonSetting) ?? string.Empty;
_activeTtsProviderId = providerId;
}
finally
@@ -299,8 +335,9 @@ private void CaptureSelectedVoiceProviderSettings()
var provider = _textToSpeechOptions.FirstOrDefault(option =>
string.Equals(option.Id, providerId, StringComparison.OrdinalIgnoreCase));
SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.ApiKey), VoiceTtsApiKeyPasswordBox.Password);
- SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.Model), VoiceTtsModelTextBox.Text);
+ SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.Model), GetSelectedProviderModelValue());
SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.VoiceId), VoiceTtsVoiceIdTextBox.Text);
+ SetProviderValue(providerId, FindSetting(provider, VoiceProviderSettingKeys.VoiceSettingsJson), VoiceTtsVoiceSettingsJsonTextBox.Text);
}
private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
@@ -335,6 +372,16 @@ private void OnVoiceProviderSettingsChanged(object sender, RoutedEventArgs e)
return _voiceProviderConfigurationDraft.GetValue(providerId, setting.Key) ?? setting.DefaultValue;
}
+ private string? GetSelectedProviderModelValue()
+ {
+ if (VoiceTtsModelComboBox.Visibility == Visibility.Visible)
+ {
+ return VoiceTtsModelComboBox.SelectedItem?.ToString();
+ }
+
+ return VoiceTtsModelTextBox.Text;
+ }
+
private sealed record DeviceOption(string? DeviceId, string Name);
private void SetProviderValue(
@@ -376,7 +423,10 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
Secret = setting.Secret,
DefaultValue = setting.DefaultValue,
Placeholder = setting.Placeholder,
- Description = setting.Description
+ Description = setting.Description,
+ Required = setting.Required,
+ JsonValue = setting.JsonValue,
+ Options = setting.Options.ToList()
})
.ToList(),
TextToSpeechHttp = source.TextToSpeechHttp == null
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index 422d952..e3f6e81 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -63,15 +63,15 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
return await CreateResultAsync(audioBytesFromJson, contract.OutputContentType);
}
- private static Dictionary<string, string> BuildTemplateValues(
+ private static Dictionary<string, TemplateValue> BuildTemplateValues(
string text,
VoiceProviderOption provider,
VoiceProviderConfiguration? providerConfiguration,
VoiceTextToSpeechHttpContract contract)
{
- var values = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
+ var values = new Dictionary<string, TemplateValue>(StringComparer.OrdinalIgnoreCase)
{
- ["text"] = text
+ ["text"] = TemplateValue.FromString(text)
};
foreach (var setting in provider.Settings)
@@ -89,38 +89,45 @@ private static Dictionary<string, string> BuildTemplateValues(
$"{provider.Name} API key is not configured. Open Settings and complete the {provider.Name} voice provider fields.");
}
- throw new InvalidOperationException(
- $"{provider.Name} setting '{setting.Label}' is required. Open Settings and complete the {provider.Name} voice provider fields.");
+ if (setting.Required)
+ {
+ throw new InvalidOperationException(
+ $"{provider.Name} setting '{setting.Label}' is required. Open Settings and complete the {provider.Name} voice provider fields.");
+ }
+
+ continue;
}
- values[setting.Key] = effectiveValue;
+ values[setting.Key] = setting.JsonValue
+ ? TemplateValue.FromJson(effectiveValue, provider.Name, setting.Label, values)
+ : TemplateValue.FromString(effectiveValue);
}
return values;
}
- private static string ApplyUrlTemplate(string template, IReadOnlyDictionary<string, string> values)
+ private static string ApplyUrlTemplate(string template, IReadOnlyDictionary<string, TemplateValue> values)
{
var result = template;
foreach (var entry in values)
{
result = result.Replace(
"{{" + entry.Key + "}}",
- Uri.EscapeDataString(entry.Value),
+ Uri.EscapeDataString(entry.Value.Value),
StringComparison.Ordinal);
}
return result;
}
- private static string ApplyJsonTemplate(string template, IReadOnlyDictionary<string, string> values)
+ private static string ApplyJsonTemplate(string template, IReadOnlyDictionary<string, TemplateValue> values)
{
var result = template;
foreach (var entry in values)
{
result = result.Replace(
"{{" + entry.Key + "}}",
- JsonSerializer.Serialize(entry.Value),
+ entry.Value.JsonFragment ? entry.Value.Value : JsonSerializer.Serialize(entry.Value.Value),
StringComparison.Ordinal);
}
@@ -130,9 +137,9 @@ private static string ApplyJsonTemplate(string template, IReadOnlyDictionary<str
private static void ApplyAuthenticationHeader(
HttpRequestMessage request,
VoiceTextToSpeechHttpContract contract,
- IReadOnlyDictionary<string, string> values)
+ IReadOnlyDictionary<string, TemplateValue> values)
{
- if (!values.TryGetValue(contract.ApiKeySettingKey, out var apiKey) || string.IsNullOrWhiteSpace(apiKey))
+ if (!values.TryGetValue(contract.ApiKeySettingKey, out var apiKey) || string.IsNullOrWhiteSpace(apiKey.Value))
{
throw new InvalidOperationException("Voice provider API key is not configured.");
}
@@ -140,13 +147,13 @@ private static void ApplyAuthenticationHeader(
if (string.Equals(contract.AuthenticationHeaderName, "Authorization", StringComparison.OrdinalIgnoreCase) &&
!string.IsNullOrWhiteSpace(contract.AuthenticationScheme))
{
- request.Headers.Authorization = new AuthenticationHeaderValue(contract.AuthenticationScheme, apiKey);
+ request.Headers.Authorization = new AuthenticationHeaderValue(contract.AuthenticationScheme, apiKey.Value);
return;
}
var headerValue = string.IsNullOrWhiteSpace(contract.AuthenticationScheme)
- ? apiKey
- : $"{contract.AuthenticationScheme} {apiKey}";
+ ? apiKey.Value
+ : $"{contract.AuthenticationScheme} {apiKey.Value}";
request.Headers.TryAddWithoutValidation(contract.AuthenticationHeaderName, headerValue);
}
@@ -276,6 +283,45 @@ private static HttpClient CreateHttpClient()
Timeout = TimeSpan.FromSeconds(30)
};
}
+
+ private readonly record struct TemplateValue(string Value, bool JsonFragment)
+ {
+ public static TemplateValue FromString(string value) => new(value, false);
+
+ public static TemplateValue FromJson(
+ string json,
+ string providerName,
+ string label,
+ IReadOnlyDictionary<string, TemplateValue>? templateValues = null)
+ {
+ var substituted = templateValues == null
+ ? json
+ : ApplyJsonTemplate(json, templateValues);
+
+ try
+ {
+ using var document = JsonDocument.Parse(substituted);
+ return new(document.RootElement.GetRawText(), true);
+ }
+ catch (JsonException ex)
+ {
+ try
+ {
+ using var wrapped = JsonDocument.Parse("{ " + substituted + " }");
+ var wrappedJson = wrapped.RootElement.GetRawText();
+ return new(wrappedJson[1..^1], true);
+ }
+ catch (JsonException)
+ {
+ throw new InvalidOperationException(
+ $"{providerName} setting '{label}' must be valid JSON.",
+ ex);
+ }
+ }
+ }
+
+ public static implicit operator string(TemplateValue value) => value.Value;
+ }
}
public sealed class VoiceCloudTextToSpeechResult : IDisposable
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index 7e3136d..bdea563 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -118,9 +118,9 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
new VoiceProviderOption
{
Id = VoiceProviderIds.MiniMax,
- Name = "MiniMax Speech 2.8 Turbo",
+ Name = "MiniMax",
Runtime = "cloud",
- Description = "Cloud TTS using MiniMax HTTP text-to-speech.",
+ Description = "Cloud TTS using the MiniMax HTTP text-to-speech API.",
Settings =
[
new VoiceProviderSettingDefinition
@@ -133,13 +133,34 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
{
Key = VoiceProviderSettingKeys.Model,
Label = "Model",
- DefaultValue = "speech-2.8-turbo"
+ DefaultValue = "speech-2.8-turbo",
+ Options =
+ [
+ "speech-2.5-turbo-preview",
+ "speech-02-turbo",
+ "speech-02-hd",
+ "speech-2.6-turbo",
+ "speech-2.6-hd",
+ "speech-2.8-turbo",
+ "speech-2.8-hd"
+ ]
},
new VoiceProviderSettingDefinition
{
Key = VoiceProviderSettingKeys.VoiceId,
Label = "Voice ID",
+ Required = false,
DefaultValue = "English_MatureBoss"
+ },
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.VoiceSettingsJson,
+ Label = "Voice settings JSON",
+ Required = false,
+ JsonValue = true,
+ DefaultValue = "\"voice_setting\": { \"voice_id\": {{voiceId}}, \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
+ Placeholder = "\"voice_setting\": { \"voice_id\": \"English_MatureBoss\", \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
+ Description = "Optional full MiniMax request fragment. If present, it controls the full voice_setting payload."
}
],
TextToSpeechHttp = new VoiceTextToSpeechHttpContract
@@ -156,12 +177,7 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
"stream": false,
"language_boost": "English",
"output_format": "hex",
- "voice_setting": {
- "voice_id": {{voiceId}},
- "speed": 1,
- "vol": 1,
- "pitch": 0
- },
+ {{voiceSettingsJson}},
"audio_setting": {
"sample_rate": 32000,
"bitrate": 128000,
@@ -196,13 +212,31 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
{
Key = VoiceProviderSettingKeys.Model,
Label = "Model",
- DefaultValue = "eleven_multilingual_v2"
+ DefaultValue = "eleven_multilingual_v2",
+ Options =
+ [
+ "eleven_flash_v2_5",
+ "eleven_turbo_v2_5",
+ "eleven_multilingual_v2",
+ "eleven_monolingual_v1"
+ ]
},
new VoiceProviderSettingDefinition
{
Key = VoiceProviderSettingKeys.VoiceId,
+ Required = false,
Label = "Voice ID",
Placeholder = "Enter an ElevenLabs voice ID"
+ },
+ new VoiceProviderSettingDefinition
+ {
+ Key = VoiceProviderSettingKeys.VoiceSettingsJson,
+ Label = "Voice settings JSON",
+ Required = false,
+ JsonValue = true,
+ DefaultValue = "\"voice_settings\": null",
+ Placeholder = "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }",
+ Description = "Optional full ElevenLabs request fragment. If present, it controls the full voice_settings payload."
}
],
TextToSpeechHttp = new VoiceTextToSpeechHttpContract
@@ -215,7 +249,8 @@ private static VoiceProviderCatalog CreateBuiltInCatalog()
RequestBodyTemplate = """
{
"text": {{text}},
- "model_id": {{model}}
+ "model_id": {{model}},
+ {{voiceSettingsJson}}
}
""",
ResponseAudioMode = VoiceTextToSpeechResponseModes.Binary,
@@ -291,9 +326,12 @@ private static VoiceProviderSettingDefinition Clone(VoiceProviderSettingDefiniti
Key = source.Key,
Label = source.Label,
Secret = source.Secret,
+ Required = source.Required,
+ JsonValue = source.JsonValue,
DefaultValue = source.DefaultValue,
Placeholder = source.Placeholder,
- Description = source.Description
+ Description = source.Description,
+ Options = source.Options.ToList()
};
}
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 807ac61..a140140 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -82,6 +82,7 @@ public void VoiceProviderIds_ExposeRequiredBuiltInProviders()
Assert.Equal("windows", VoiceProviderIds.Windows);
Assert.Equal("minimax", VoiceProviderIds.MiniMax);
Assert.Equal("elevenlabs", VoiceProviderIds.ElevenLabs);
+ Assert.Equal("voiceSettingsJson", VoiceProviderSettingKeys.VoiceSettingsJson);
}
[Fact]
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 86e576f..1edf191 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -69,7 +69,8 @@ public void RoundTrip_AllFields_Preserved()
{
[VoiceProviderSettingKeys.ApiKey] = "minimax-key",
[VoiceProviderSettingKeys.Model] = "speech-2.8-turbo",
- [VoiceProviderSettingKeys.VoiceId] = "English_MatureBoss"
+ [VoiceProviderSettingKeys.VoiceId] = "English_MatureBoss",
+ [VoiceProviderSettingKeys.VoiceSettingsJson] = "{\"voice_id\":\"English_MatureBoss\",\"speed\":1.1}"
}
},
new VoiceProviderConfiguration
@@ -129,6 +130,7 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal("minimax-key", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.ApiKey));
Assert.Equal("speech-2.8-turbo", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.Model));
Assert.Equal("English_MatureBoss", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.VoiceId));
+ Assert.Equal("{\"voice_id\":\"English_MatureBoss\",\"speed\":1.1}", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.VoiceSettingsJson));
Assert.Equal("eleven-key", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.ApiKey));
Assert.Equal("eleven_multilingual_v2", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.Model));
Assert.Equal("voice-42", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.ElevenLabs, VoiceProviderSettingKeys.VoiceId));
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 99fae22..c036611 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -30,20 +30,37 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
var catalog = VoiceProviderCatalogService.LoadCatalog();
var minimax = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
+ Assert.Equal("MiniMax", minimax.Name);
Assert.NotNull(minimax.TextToSpeechHttp);
Assert.Equal("https://api.minimax.io/v1/t2a_v2", minimax.TextToSpeechHttp!.EndpointTemplate);
Assert.Equal("Authorization", minimax.TextToSpeechHttp.AuthenticationHeaderName);
Assert.Equal(VoiceTextToSpeechResponseModes.HexJsonString, minimax.TextToSpeechHttp.ResponseAudioMode);
- Assert.Equal("speech-2.8-turbo", minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model).DefaultValue);
+ var minimaxModelSetting = minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model);
+ Assert.Equal("speech-2.8-turbo", minimaxModelSetting.DefaultValue);
+ Assert.Contains("speech-2.8-turbo", minimaxModelSetting.Options);
+ Assert.Contains("speech-2.5-turbo-preview", minimaxModelSetting.Options);
Assert.Equal("English_MatureBoss", minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.VoiceId).DefaultValue);
+ var minimaxVoiceSettingsJson = minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.VoiceSettingsJson);
+ Assert.False(minimaxVoiceSettingsJson.Required);
+ Assert.True(minimaxVoiceSettingsJson.JsonValue);
+ Assert.Contains("\"voice_setting\":", minimaxVoiceSettingsJson.Placeholder);
+ Assert.Contains("{{voiceId}}", minimaxVoiceSettingsJson.DefaultValue);
var elevenLabs = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.ElevenLabs);
+ Assert.Equal("ElevenLabs", elevenLabs.Name);
Assert.NotNull(elevenLabs.TextToSpeechHttp);
Assert.Equal(
"https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
elevenLabs.TextToSpeechHttp!.EndpointTemplate);
Assert.Equal("xi-api-key", elevenLabs.TextToSpeechHttp.AuthenticationHeaderName);
Assert.Equal(VoiceTextToSpeechResponseModes.Binary, elevenLabs.TextToSpeechHttp.ResponseAudioMode);
- Assert.Equal("eleven_multilingual_v2", elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model).DefaultValue);
+ var elevenLabsModelSetting = elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model);
+ Assert.Equal("eleven_multilingual_v2", elevenLabsModelSetting.DefaultValue);
+ Assert.Contains("eleven_flash_v2_5", elevenLabsModelSetting.Options);
+ Assert.Contains("eleven_turbo_v2_5", elevenLabsModelSetting.Options);
+ var elevenLabsVoiceSettingsJson = elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.VoiceSettingsJson);
+ Assert.False(elevenLabsVoiceSettingsJson.Required);
+ Assert.True(elevenLabsVoiceSettingsJson.JsonValue);
+ Assert.Equal("\"voice_settings\": null", elevenLabsVoiceSettingsJson.DefaultValue);
}
}
From c1cc0ffcfcad884e839d2532d16439485eced1e3 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 21:31:39 +0000
Subject: [PATCH 21/83] Ship voice provider catalog with the tray app
---
docs/VOICE-MODE.md | 14 +-
.../Assets/voice-providers.json | 128 +++++++++
.../Voice/VoiceProviderCatalogService.cs | 246 ++++--------------
.../VoiceProviderCatalogServiceTests.cs | 9 +
4 files changed, 188 insertions(+), 209 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 21b960a..556451b 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -138,11 +138,11 @@ Runtime behavior in the current phase:
- non-Windows providers can be selected and persisted now
- unsupported providers fall back to Windows at runtime with a status warning
-### Local Provider Catalog
+### Provider Catalog
-Additional provider entries are supplied through a local catalog file:
+The provider catalog now ships with the tray app as a bundled asset:
-- `%APPDATA%\\OpenClawTray\\voice-providers.json`
+- `Assets\\voice-providers.json`
Example:
@@ -253,7 +253,7 @@ Example:
}
```
-For HTTP-backed TTS providers, the catalog carries the request/response contract. That allows a new provider to be added without recompilation, as long as it follows the same general HTTP template approach.
+For HTTP-backed TTS providers, the catalog carries the request/response contract. That allows a new provider to be added by shipping an updated catalog file with the app, as long as it follows the same general HTTP template approach.
This file defines provider metadata and HTTP contracts. It does not carry API keys.
@@ -264,7 +264,7 @@ That means the current design is:
- local tray settings choose the preferred STT/TTS provider ids
- provider API keys and editable values are stored in `%APPDATA%\\OpenClawTray\\settings.json` under `VoiceProviderConfiguration`
- OpenClaw remains the conversation endpoint for `chat.send`
-- the local provider catalog remains metadata-only and must not contain secrets
+- the shipped provider catalog remains metadata-only and must not contain secrets
This is an intentional short-term design choice so the Windows tray app can use cloud TTS providers without inventing a second catalog file for secrets. It can be revisited later if provider ownership is split differently.
@@ -421,7 +421,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Voice.TalkMode.EndSilenceMs` | int | `900` | talk mode | Silence timeout used to finalize an utterance |
| `Voice.TalkMode.MaxUtteranceMs` | int | `15000` | talk mode | Hard cap on utterance length before forced submission/finalization |
| `Voice.TalkMode.ChatWindowSubmitMode` | enum | `AutoSend` | talk mode | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
-| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching a `voice-providers.json` entry |
+| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching an `Assets\\voice-providers.json` entry |
| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
| `VoiceProviderConfiguration.Providers[].Values["model"]` | string? | provider default | cloud providers | Model identifier inserted into the configured request template |
| `VoiceProviderConfiguration.Providers[].Values["voiceId"]` | string? | provider default | cloud providers | Voice id inserted into the configured request template or URL |
@@ -556,7 +556,7 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForVoiceWake)
Provider support is now part of the Windows voice subsystem roadmap, not a hypothetical extension:
- `MiniMax` and `ElevenLabs` TTS are both expressed through built-in catalog contracts
-- additional HTTP TTS providers can be added through the local catalog without recompiling the tray app
+- additional HTTP TTS providers can be added by extending the shipped catalog without recompiling the tray app itself
- Windows STT remains the active speech-recognition baseline until a non-Windows STT provider is deliberately added
The Windows node still keeps provider choice bounded:
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
new file mode 100644
index 0000000..e492e40
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -0,0 +1,128 @@
+{
+ "speechToTextProviders": [
+ {
+ "id": "windows",
+ "name": "Windows Speech Recognition",
+ "runtime": "windows",
+ "enabled": true,
+ "description": "Built-in Windows dictation and speech recognition."
+ }
+ ],
+ "textToSpeechProviders": [
+ {
+ "id": "windows",
+ "name": "Windows Speech Synthesis",
+ "runtime": "windows",
+ "enabled": true,
+ "description": "Built-in Windows text-to-speech playback."
+ },
+ {
+ "id": "minimax",
+ "name": "MiniMax",
+ "runtime": "cloud",
+ "enabled": true,
+ "description": "Cloud TTS using the MiniMax HTTP text-to-speech API.",
+ "settings": [
+ {
+ "key": "apiKey",
+ "label": "API key",
+ "secret": true
+ },
+ {
+ "key": "model",
+ "label": "Model",
+ "defaultValue": "speech-2.8-turbo",
+ "options": [
+ "speech-2.5-turbo-preview",
+ "speech-02-turbo",
+ "speech-02-hd",
+ "speech-2.6-turbo",
+ "speech-2.6-hd",
+ "speech-2.8-turbo",
+ "speech-2.8-hd"
+ ]
+ },
+ {
+ "key": "voiceId",
+ "label": "Voice ID",
+ "required": false,
+ "defaultValue": "English_MatureBoss"
+ },
+ {
+ "key": "voiceSettingsJson",
+ "label": "Voice settings JSON",
+ "required": false,
+ "jsonValue": true,
+ "defaultValue": "\"voice_setting\": { \"voice_id\": {{voiceId}}, \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
+ "placeholder": "\"voice_setting\": { \"voice_id\": \"English_MatureBoss\", \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
+ "description": "Optional full MiniMax request fragment. If present, it controls the full voice_setting payload."
+ }
+ ],
+ "textToSpeechHttp": {
+ "endpointTemplate": "https://api.minimax.io/v1/t2a_v2",
+ "httpMethod": "POST",
+ "authenticationHeaderName": "Authorization",
+ "authenticationScheme": "Bearer",
+ "apiKeySettingKey": "apiKey",
+ "requestContentType": "application/json",
+ "requestBodyTemplate": "{ \"model\": {{model}}, \"text\": {{text}}, \"stream\": false, \"language_boost\": \"English\", \"output_format\": \"hex\", {{voiceSettingsJson}}, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "responseAudioMode": "hexJsonString",
+ "responseAudioJsonPath": "data.audio",
+ "responseStatusCodeJsonPath": "base_resp.status_code",
+ "responseStatusMessageJsonPath": "base_resp.status_msg",
+ "successStatusValue": "0",
+ "outputContentType": "audio/mpeg"
+ }
+ },
+ {
+ "id": "elevenlabs",
+ "name": "ElevenLabs",
+ "runtime": "cloud",
+ "enabled": true,
+ "description": "Cloud TTS using the ElevenLabs create speech API.",
+ "settings": [
+ {
+ "key": "apiKey",
+ "label": "API key",
+ "secret": true
+ },
+ {
+ "key": "model",
+ "label": "Model",
+ "defaultValue": "eleven_multilingual_v2",
+ "options": [
+ "eleven_flash_v2_5",
+ "eleven_turbo_v2_5",
+ "eleven_multilingual_v2",
+ "eleven_monolingual_v1"
+ ]
+ },
+ {
+ "key": "voiceId",
+ "label": "Voice ID",
+ "required": false,
+ "placeholder": "Enter an ElevenLabs voice ID"
+ },
+ {
+ "key": "voiceSettingsJson",
+ "label": "Voice settings JSON",
+ "required": false,
+ "jsonValue": true,
+ "defaultValue": "\"voice_settings\": null",
+ "placeholder": "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }",
+ "description": "Optional full ElevenLabs request fragment. If present, it controls the full voice_settings payload."
+ }
+ ],
+ "textToSpeechHttp": {
+ "endpointTemplate": "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
+ "httpMethod": "POST",
+ "authenticationHeaderName": "xi-api-key",
+ "apiKeySettingKey": "apiKey",
+ "requestContentType": "application/json",
+ "requestBodyTemplate": "{ \"text\": {{text}}, \"model_id\": {{model}}, {{voiceSettingsJson}} }",
+ "responseAudioMode": "binary",
+ "outputContentType": "audio/mpeg"
+ }
+ }
+ ]
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index bdea563..6fa68fe 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -10,11 +10,7 @@ namespace OpenClawTray.Services.Voice;
public static class VoiceProviderCatalogService
{
private const long MaxCatalogBytes = 256 * 1024;
- private const int MaxProviderEntriesPerList = 64;
- private static readonly string s_catalogFilePath = Path.Combine(
- Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData),
- "OpenClawTray",
- "voice-providers.json");
+ private const string CatalogRelativePath = "Assets\\voice-providers.json";
private static readonly JsonSerializerOptions s_jsonOptions = new()
{
@@ -22,46 +18,40 @@ public static class VoiceProviderCatalogService
WriteIndented = true
};
- public static string CatalogFilePath => s_catalogFilePath;
+ public static string CatalogFilePath => ResolveCatalogFilePath();
public static VoiceProviderCatalog LoadCatalog(IOpenClawLogger? logger = null)
{
- var merged = CreateBuiltInCatalog();
+ var catalogFilePath = ResolveCatalogFilePath();
try
{
- if (!File.Exists(s_catalogFilePath))
+ if (!File.Exists(catalogFilePath))
{
- return merged;
+ throw new FileNotFoundException("Voice provider catalog asset not found.", catalogFilePath);
}
- var fileInfo = new FileInfo(s_catalogFilePath);
+ var fileInfo = new FileInfo(catalogFilePath);
if (fileInfo.Length > MaxCatalogBytes)
{
- logger?.Warn($"Voice provider catalog exceeds {MaxCatalogBytes} bytes and will be ignored.");
- return merged;
+ throw new InvalidOperationException($"Voice provider catalog exceeds {MaxCatalogBytes} bytes.");
}
- var json = File.ReadAllText(s_catalogFilePath);
- var configured = JsonSerializer.Deserialize<VoiceProviderCatalog>(json, s_jsonOptions);
- if (configured == null)
+ var json = File.ReadAllText(catalogFilePath);
+ var catalog = JsonSerializer.Deserialize<VoiceProviderCatalog>(json, s_jsonOptions);
+ if (catalog == null)
{
- return merged;
+ throw new InvalidOperationException("Voice provider catalog asset is empty or invalid.");
}
- merged.SpeechToTextProviders = MergeProviders(
- merged.SpeechToTextProviders,
- configured.SpeechToTextProviders);
- merged.TextToSpeechProviders = MergeProviders(
- merged.TextToSpeechProviders,
- configured.TextToSpeechProviders);
+ return NormalizeCatalog(catalog);
}
catch (Exception ex)
{
- logger?.Warn($"Failed to load voice provider catalog: {ex.Message}");
+ throw new InvalidOperationException(
+ $"Failed to load voice provider catalog from '{catalogFilePath}': {ex.Message}",
+ ex);
}
-
- return merged;
}
public static VoiceProviderOption ResolveSpeechToTextProvider(string? providerId, IOpenClawLogger? logger = null)
@@ -92,191 +82,20 @@ public static bool SupportsTextToSpeechRuntime(string? providerId)
return provider.TextToSpeechHttp != null;
}
- private static VoiceProviderCatalog CreateBuiltInCatalog()
+ private static VoiceProviderCatalog NormalizeCatalog(VoiceProviderCatalog catalog)
{
return new VoiceProviderCatalog
{
- SpeechToTextProviders =
- [
- new VoiceProviderOption
- {
- Id = VoiceProviderIds.Windows,
- Name = "Windows Speech Recognition",
- Runtime = "windows",
- Description = "Built-in Windows dictation and speech recognition."
- }
- ],
- TextToSpeechProviders =
- [
- new VoiceProviderOption
- {
- Id = VoiceProviderIds.Windows,
- Name = "Windows Speech Synthesis",
- Runtime = "windows",
- Description = "Built-in Windows text-to-speech playback."
- },
- new VoiceProviderOption
- {
- Id = VoiceProviderIds.MiniMax,
- Name = "MiniMax",
- Runtime = "cloud",
- Description = "Cloud TTS using the MiniMax HTTP text-to-speech API.",
- Settings =
- [
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.ApiKey,
- Label = "API key",
- Secret = true
- },
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.Model,
- Label = "Model",
- DefaultValue = "speech-2.8-turbo",
- Options =
- [
- "speech-2.5-turbo-preview",
- "speech-02-turbo",
- "speech-02-hd",
- "speech-2.6-turbo",
- "speech-2.6-hd",
- "speech-2.8-turbo",
- "speech-2.8-hd"
- ]
- },
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.VoiceId,
- Label = "Voice ID",
- Required = false,
- DefaultValue = "English_MatureBoss"
- },
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.VoiceSettingsJson,
- Label = "Voice settings JSON",
- Required = false,
- JsonValue = true,
- DefaultValue = "\"voice_setting\": { \"voice_id\": {{voiceId}}, \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
- Placeholder = "\"voice_setting\": { \"voice_id\": \"English_MatureBoss\", \"speed\": 1, \"vol\": 1, \"pitch\": 0 }",
- Description = "Optional full MiniMax request fragment. If present, it controls the full voice_setting payload."
- }
- ],
- TextToSpeechHttp = new VoiceTextToSpeechHttpContract
- {
- EndpointTemplate = "https://api.minimax.io/v1/t2a_v2",
- AuthenticationHeaderName = "Authorization",
- AuthenticationScheme = "Bearer",
- ApiKeySettingKey = VoiceProviderSettingKeys.ApiKey,
- RequestContentType = "application/json",
- RequestBodyTemplate = """
- {
- "model": {{model}},
- "text": {{text}},
- "stream": false,
- "language_boost": "English",
- "output_format": "hex",
- {{voiceSettingsJson}},
- "audio_setting": {
- "sample_rate": 32000,
- "bitrate": 128000,
- "format": "mp3",
- "channel": 1
- }
- }
- """,
- ResponseAudioMode = VoiceTextToSpeechResponseModes.HexJsonString,
- ResponseAudioJsonPath = "data.audio",
- ResponseStatusCodeJsonPath = "base_resp.status_code",
- ResponseStatusMessageJsonPath = "base_resp.status_msg",
- SuccessStatusValue = "0",
- OutputContentType = "audio/mpeg"
- }
- },
- new VoiceProviderOption
- {
- Id = VoiceProviderIds.ElevenLabs,
- Name = "ElevenLabs",
- Runtime = "cloud",
- Description = "Cloud TTS using the ElevenLabs create speech API.",
- Settings =
- [
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.ApiKey,
- Label = "API key",
- Secret = true
- },
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.Model,
- Label = "Model",
- DefaultValue = "eleven_multilingual_v2",
- Options =
- [
- "eleven_flash_v2_5",
- "eleven_turbo_v2_5",
- "eleven_multilingual_v2",
- "eleven_monolingual_v1"
- ]
- },
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.VoiceId,
- Required = false,
- Label = "Voice ID",
- Placeholder = "Enter an ElevenLabs voice ID"
- },
- new VoiceProviderSettingDefinition
- {
- Key = VoiceProviderSettingKeys.VoiceSettingsJson,
- Label = "Voice settings JSON",
- Required = false,
- JsonValue = true,
- DefaultValue = "\"voice_settings\": null",
- Placeholder = "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }",
- Description = "Optional full ElevenLabs request fragment. If present, it controls the full voice_settings payload."
- }
- ],
- TextToSpeechHttp = new VoiceTextToSpeechHttpContract
- {
- EndpointTemplate = "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
- AuthenticationHeaderName = "xi-api-key",
- AuthenticationScheme = null,
- ApiKeySettingKey = VoiceProviderSettingKeys.ApiKey,
- RequestContentType = "application/json",
- RequestBodyTemplate = """
- {
- "text": {{text}},
- "model_id": {{model}},
- {{voiceSettingsJson}}
- }
- """,
- ResponseAudioMode = VoiceTextToSpeechResponseModes.Binary,
- OutputContentType = "audio/mpeg"
- }
- }
- ]
+ SpeechToTextProviders = NormalizeProviders(catalog.SpeechToTextProviders),
+ TextToSpeechProviders = NormalizeProviders(catalog.TextToSpeechProviders)
};
}
- private static List<VoiceProviderOption> MergeProviders(
- List<VoiceProviderOption> builtIn,
- List<VoiceProviderOption> configured)
+ private static List<VoiceProviderOption> NormalizeProviders(List<VoiceProviderOption> providers)
{
- var merged = builtIn
- .Select(Clone)
- .ToDictionary(p => p.Id, StringComparer.OrdinalIgnoreCase);
-
- foreach (var provider in configured
+ return providers
.Where(p => !string.IsNullOrWhiteSpace(p.Id))
- .Take(MaxProviderEntriesPerList))
- {
- merged[provider.Id] = Clone(provider);
- }
-
- return merged.Values
+ .Select(Clone)
.Where(p => p.Enabled)
.OrderByDescending(p => string.Equals(p.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
.ThenBy(p => p.Name, StringComparer.OrdinalIgnoreCase)
@@ -359,4 +178,27 @@ private static VoiceProviderSettingDefinition Clone(VoiceProviderSettingDefiniti
OutputContentType = source.OutputContentType
};
}
+
+ private static string ResolveCatalogFilePath()
+ {
+ var bundledPath = Path.Combine(AppContext.BaseDirectory, CatalogRelativePath);
+ if (File.Exists(bundledPath))
+ {
+ return bundledPath;
+ }
+
+ var current = new DirectoryInfo(AppContext.BaseDirectory);
+ while (current != null)
+ {
+ var sourcePath = Path.Combine(current.FullName, "src", "OpenClaw.Tray.WinUI", CatalogRelativePath);
+ if (File.Exists(sourcePath))
+ {
+ return sourcePath;
+ }
+
+ current = current.Parent;
+ }
+
+ return bundledPath;
+ }
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index c036611..c7ca7ed 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -1,3 +1,5 @@
+using System;
+using System.IO;
using OpenClaw.Shared;
using OpenClawTray.Services.Voice;
using System.Linq;
@@ -6,6 +8,13 @@ namespace OpenClaw.Tray.Tests;
public class VoiceProviderCatalogServiceTests
{
+ [Fact]
+ public void CatalogFilePath_ResolvesToExistingBundledAsset()
+ {
+ Assert.EndsWith("voice-providers.json", VoiceProviderCatalogService.CatalogFilePath, StringComparison.OrdinalIgnoreCase);
+ Assert.True(File.Exists(VoiceProviderCatalogService.CatalogFilePath));
+ }
+
[Fact]
public void LoadCatalog_IncludesBuiltInMiniMaxAndElevenLabsTtsProviders()
{
From 83f05ee7a092333b195760272295d76517e48d46 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 21:50:11 +0000
Subject: [PATCH 22/83] Instrument voice output latency and reduce TTS
buffering
---
docs/VOICE-MODE.md | 19 ++++++++++++
.../Voice/VoiceCloudTextToSpeechClient.cs | 29 ++++++++++++++++---
.../Services/Voice/VoiceService.cs | 5 +++-
3 files changed, 48 insertions(+), 5 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 556451b..d8ae4e8 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -59,6 +59,25 @@ That means the first Windows target is transcript transport, not raw audio uploa
The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `TalkMode`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
+## Speech Output Latency
+
+Microsoft's Azure Speech SDK latency guidance is specifically about speech synthesis, not speech recognition, so it applies to Windows voice output rather than voice input. Source: [Lower speech synthesis latency using Speech SDK](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-csharp).
+
+The current Windows implementation already follows the guidance where it maps cleanly:
+
+- the Windows `SpeechSynthesizer` is created once per `TalkMode` runtime and reused for subsequent replies
+- cloud TTS uses a shared static `HttpClient`, so HTTP/TLS connections can be reused across replies
+- cloud requests use `ResponseHeadersRead`, which lets the client observe response-header arrival without waiting for full buffering first
+- the tray app now logs per-reply synthesis timings for both Windows and cloud TTS paths so latency can be measured directly during testing
+
+The main remaining gap is streaming playback from the first audio chunk. The Azure guidance recommends chunked playback as soon as the first audio arrives, but the current Windows implementation still waits for a complete playable stream before starting output:
+
+- Windows `SpeechSynthesizer` is used through `SynthesizeTextToStreamAsync`, which returns a complete stream for playback
+- MiniMax currently returns audio inside a JSON body, so playback cannot begin until the full response is available
+- ElevenLabs is currently integrated through the non-streaming convert contract in the provider catalog
+
+So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming.
+
## Tray Chat Integration Decision
Voice mode and typed chat must remain part of the same user-visible conversation in the tray app. Creating a separate "voice session" would reduce implementation complexity, but it would make the chat experience harder to understand:
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index e3f6e81..8830660 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -1,4 +1,5 @@
using System;
+using System.Diagnostics;
using System.Collections.Generic;
using System.Net.Http;
using System.Net.Http.Headers;
@@ -18,7 +19,8 @@ public sealed class VoiceCloudTextToSpeechClient
public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
string text,
VoiceProviderOption provider,
- VoiceProviderConfigurationStore configurationStore)
+ VoiceProviderConfigurationStore configurationStore,
+ IOpenClawLogger? logger = null)
{
ArgumentException.ThrowIfNullOrWhiteSpace(text);
ArgumentNullException.ThrowIfNull(provider);
@@ -41,7 +43,9 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
string.IsNullOrWhiteSpace(contract.RequestContentType) ? "application/json" : contract.RequestContentType);
}
+ var stopwatch = Stopwatch.StartNew();
using var response = await s_httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead);
+ var headersElapsedMs = stopwatch.ElapsedMilliseconds;
if (!response.IsSuccessStatusCode)
{
throw new InvalidOperationException(
@@ -50,8 +54,10 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
if (string.Equals(contract.ResponseAudioMode, VoiceTextToSpeechResponseModes.Binary, StringComparison.OrdinalIgnoreCase))
{
- var audioBytes = await response.Content.ReadAsByteArrayAsync();
- return await CreateResultAsync(audioBytes, contract.OutputContentType);
+ await using var responseStream = await response.Content.ReadAsStreamAsync();
+ var result = await CreateResultAsync(responseStream, contract.OutputContentType);
+ logger?.Info($"{provider.Name} TTS latency: headers={headersElapsedMs}ms total={stopwatch.ElapsedMilliseconds}ms (binary)");
+ return result;
}
var responseText = await response.Content.ReadAsStringAsync();
@@ -60,7 +66,9 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
var audioString = GetRequiredJsonString(document.RootElement, contract.ResponseAudioJsonPath);
var audioBytesFromJson = DecodeAudioBytes(contract.ResponseAudioMode, audioString, provider.Name);
- return await CreateResultAsync(audioBytesFromJson, contract.OutputContentType);
+ var jsonResult = await CreateResultAsync(audioBytesFromJson, contract.OutputContentType);
+ logger?.Info($"{provider.Name} TTS latency: headers={headersElapsedMs}ms total={stopwatch.ElapsedMilliseconds}ms ({contract.ResponseAudioMode})");
+ return jsonResult;
}
private static Dictionary<string, TemplateValue> BuildTemplateValues(
@@ -276,6 +284,19 @@ private static async Task<VoiceCloudTextToSpeechResult> CreateResultAsync(byte[]
return new VoiceCloudTextToSpeechResult(stream, string.IsNullOrWhiteSpace(contentType) ? "audio/mpeg" : contentType);
}
+ private static async Task<VoiceCloudTextToSpeechResult> CreateResultAsync(Stream sourceStream, string contentType)
+ {
+ var stream = new InMemoryRandomAccessStream();
+ await using (var output = stream.AsStreamForWrite())
+ {
+ await sourceStream.CopyToAsync(output);
+ await output.FlushAsync();
+ }
+
+ stream.Seek(0);
+ return new VoiceCloudTextToSpeechResult(stream, string.IsNullOrWhiteSpace(contentType) ? "audio/mpeg" : contentType);
+ }
+
private static HttpClient CreateHttpClient()
{
return new HttpClient
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 57ee68b..1168852 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -3,6 +3,7 @@
using System.Globalization;
using System.Linq;
using System.Net.Http;
+using System.Diagnostics;
using System.Runtime.InteropServices.WindowsRuntime;
using System.Text.Json;
using System.Threading;
@@ -989,7 +990,7 @@ private async Task SpeakTextAsync(string text)
if (provider.TextToSpeechHttp != null)
{
- using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration);
+ using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger);
await PlayStreamAsync(player, result.Stream, result.ContentType);
return;
}
@@ -999,7 +1000,9 @@ private async Task SpeakTextAsync(string text)
throw new InvalidOperationException("Speech playback is not ready.");
}
+ var stopwatch = Stopwatch.StartNew();
using var stream = await synthesizer.SynthesizeTextToStreamAsync(text);
+ _logger.Info($"Windows TTS latency: total={stopwatch.ElapsedMilliseconds}ms");
await PlayStreamAsync(player, stream, stream.ContentType);
}
From d1374092d9d3b7d6537cae235d09b2dd05db6add Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 21:56:18 +0000
Subject: [PATCH 23/83] Tighten talk mode speech recognition filtering
---
docs/VOICE-MODE.md | 3 +++
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs | 8 ++++++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index d8ae4e8..f30492c 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -50,11 +50,14 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- the node captures audio locally
- local speech recognition turns that audio into transcript text
+- interim hypotheses are surfaced live, but only final `Medium` or `High` confidence recognizer results are submitted
- if the tray chat window is open and ready, the final transcript is submitted through the tray chat window's own compose/send path
- otherwise, the transcript is sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
- the node performs local TTS playback of that reply
+To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
+
That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `TalkMode`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 1168852..a8ede20 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -28,7 +28,7 @@ public sealed class VoiceService : IVoiceRuntime, IDisposable
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
- private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromSeconds(2);
+ private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
@@ -638,11 +638,14 @@ private async void OnSpeechResultGenerated(
}
if (result.Status != SpeechRecognitionResultStatus.Success ||
- result.Confidence == SpeechRecognitionConfidence.Rejected)
+ result.Confidence == SpeechRecognitionConfidence.Rejected ||
+ result.Confidence == SpeechRecognitionConfidence.Low)
{
+ _logger.Info($"Voice recognition ignored result with confidence {result.Confidence}: {text}");
return;
}
+ _logger.Info($"Voice recognition result ({result.Confidence}): {text}");
await HandleRecognizedTextAsync(text);
}
catch (Exception ex)
@@ -706,6 +709,7 @@ private async Task HandleRecognizedTextAsync(string text)
if (string.Equals(text, _lastTranscript, StringComparison.OrdinalIgnoreCase) &&
DateTime.UtcNow - _lastTranscriptUtc < DuplicateTranscriptWindow)
{
+ _logger.Info($"Voice recognition suppressed duplicate transcript: {text}");
return;
}
From 05d7bae8965daba12ca8019663cee071d4e38d36 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 22:36:02 +0000
Subject: [PATCH 24/83] Use MiniMax api-uw endpoint for lower TTS latency
---
src/OpenClaw.Tray.WinUI/Assets/voice-providers.json | 2 +-
tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index e492e40..1df157b 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -59,7 +59,7 @@
}
],
"textToSpeechHttp": {
- "endpointTemplate": "https://api.minimax.io/v1/t2a_v2",
+ "endpointTemplate": "https://api-uw.minimax.io/v1/t2a_v2",
"httpMethod": "POST",
"authenticationHeaderName": "Authorization",
"authenticationScheme": "Bearer",
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index c7ca7ed..3bef979 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -41,7 +41,7 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
var minimax = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
Assert.Equal("MiniMax", minimax.Name);
Assert.NotNull(minimax.TextToSpeechHttp);
- Assert.Equal("https://api.minimax.io/v1/t2a_v2", minimax.TextToSpeechHttp!.EndpointTemplate);
+ Assert.Equal("https://api-uw.minimax.io/v1/t2a_v2", minimax.TextToSpeechHttp!.EndpointTemplate);
Assert.Equal("Authorization", minimax.TextToSpeechHttp.AuthenticationHeaderName);
Assert.Equal(VoiceTextToSpeechResponseModes.HexJsonString, minimax.TextToSpeechHttp.ResponseAudioMode);
var minimaxModelSetting = minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model);
From 5efcebfe31082404ae3e772e6f53a813d52f88c9 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 22:47:04 +0000
Subject: [PATCH 25/83] Add catalog-driven MiniMax WebSocket TTS
---
docs/VOICE-MODE.md | 30 ++-
src/OpenClaw.Shared/VoiceModeSchema.cs | 22 ++
.../Assets/voice-providers.json | 14 +-
.../Controls/VoiceSettingsPanel.xaml.cs | 22 ++
.../Voice/VoiceCloudTextToSpeechClient.cs | 214 +++++++++++++++++-
.../Voice/VoiceProviderCatalogService.cs | 34 ++-
.../VoiceProviderCatalogServiceTests.cs | 10 +-
7 files changed, 324 insertions(+), 22 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index f30492c..172a5c0 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -76,7 +76,7 @@ The current Windows implementation already follows the guidance where it maps cl
The main remaining gap is streaming playback from the first audio chunk. The Azure guidance recommends chunked playback as soon as the first audio arrives, but the current Windows implementation still waits for a complete playable stream before starting output:
- Windows `SpeechSynthesizer` is used through `SynthesizeTextToStreamAsync`, which returns a complete stream for playback
-- MiniMax currently returns audio inside a JSON body, so playback cannot begin until the full response is available
+- MiniMax now uses the provider catalog's WebSocket TTS contract, but the current player still waits for a complete playable stream before output starts
- ElevenLabs is currently integrated through the non-streaming convert contract in the provider catalog
So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming.
@@ -156,6 +156,7 @@ Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
- built-in catalog entries exist for both `minimax` and `elevenlabs` TTS
- `minimax` defaults to `speech-2.8-turbo` and `English_MatureBoss`
+- `minimax` now uses a catalog-driven WebSocket contract for synchronous TTS
- `elevenlabs` defaults to `eleven_multilingual_v2` and a user-supplied voice id
- non-Windows providers can be selected and persisted now
- unsupported providers fall back to Windows at runtime with a status warning
@@ -192,7 +193,7 @@ Example:
"name": "MiniMax",
"runtime": "cloud",
"enabled": true,
- "description": "Cloud TTS using the MiniMax HTTP text-to-speech API.",
+ "description": "Cloud TTS using the MiniMax WebSocket text-to-speech API.",
"settings": [
{ "key": "apiKey", "label": "API key", "secret": true },
{
@@ -217,18 +218,22 @@ Example:
"placeholder": "\"voice_setting\": { \"voice_id\": \"English_MatureBoss\", \"speed\": 1, \"vol\": 1, \"pitch\": 0 }"
}
],
- "textToSpeechHttp": {
- "endpointTemplate": "https://api.minimax.io/v1/t2a_v2",
- "httpMethod": "POST",
+ "textToSpeechWebSocket": {
+ "endpointTemplate": "wss://api.minimax.io/ws/v1/t2a_v2",
"authenticationHeaderName": "Authorization",
"authenticationScheme": "Bearer",
"apiKeySettingKey": "apiKey",
- "requestContentType": "application/json",
- "requestBodyTemplate": "{ \"model\": {{model}}, \"text\": {{text}}, \"stream\": false, \"language_boost\": \"English\", \"output_format\": \"hex\", {{voiceSettingsJson}}, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "connectSuccessEventName": "connected_success",
+ "startMessageTemplate": "{ \"event\": \"task_start\", \"model\": {{model}}, \"language_boost\": \"English\", {{voiceSettingsJson}}, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "startSuccessEventName": "task_started",
+ "continueMessageTemplate": "{ \"event\": \"task_continue\", \"text\": {{text}} }",
+ "finishMessageTemplate": "{ \"event\": \"task_finish\" }",
"responseAudioMode": "hexJsonString",
"responseAudioJsonPath": "data.audio",
"responseStatusCodeJsonPath": "base_resp.status_code",
"responseStatusMessageJsonPath": "base_resp.status_msg",
+ "finalFlagJsonPath": "is_final",
+ "taskFailedEventName": "task_failed",
"successStatusValue": "0",
"outputContentType": "audio/mpeg"
}
@@ -275,9 +280,9 @@ Example:
}
```
-For HTTP-backed TTS providers, the catalog carries the request/response contract. That allows a new provider to be added by shipping an updated catalog file with the app, as long as it follows the same general HTTP template approach.
+For cloud-backed TTS providers, the catalog carries either an HTTP or WebSocket request/response contract. That allows a new provider to be added by shipping an updated catalog file with the app, as long as it follows the same general templated transport approach.
-This file defines provider metadata and HTTP contracts. It does not carry API keys.
+This file defines provider metadata and transport contracts. It does not carry API keys.
### Local Provider Configuration
@@ -313,6 +318,11 @@ If a provider setting definition is marked as JSON, the value is inserted into t
without hard-coding provider-specific wrapper keys into the runtime.
+The current cloud TTS transports are:
+
+- `MiniMax`: catalog-driven WebSocket synthesis
+- `ElevenLabs`: catalog-driven HTTP synthesis
+
For `VoiceWake`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
## Command Surface
@@ -578,7 +588,7 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForVoiceWake)
Provider support is now part of the Windows voice subsystem roadmap, not a hypothetical extension:
- `MiniMax` and `ElevenLabs` TTS are both expressed through built-in catalog contracts
-- additional HTTP TTS providers can be added by extending the shipped catalog without recompiling the tray app itself
+- additional HTTP or WebSocket TTS providers can be added by extending the shipped catalog without recompiling the tray app itself
- Windows STT remains the active speech-recognition baseline until a non-Windows STT provider is deliberately added
The Windows node still keeps provider choice bounded:
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 24dd672..da4f759 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -208,6 +208,27 @@ public sealed class VoiceTextToSpeechHttpContract
public string OutputContentType { get; set; } = "audio/mpeg";
}
+public sealed class VoiceTextToSpeechWebSocketContract
+{
+ public string EndpointTemplate { get; set; } = "";
+ public string AuthenticationHeaderName { get; set; } = "Authorization";
+ public string? AuthenticationScheme { get; set; } = "Bearer";
+ public string ApiKeySettingKey { get; set; } = VoiceProviderSettingKeys.ApiKey;
+ public string ConnectSuccessEventName { get; set; } = "connected_success";
+ public string StartMessageTemplate { get; set; } = "";
+ public string StartSuccessEventName { get; set; } = "task_started";
+ public string ContinueMessageTemplate { get; set; } = "";
+ public string FinishMessageTemplate { get; set; } = "{ \"event\": \"task_finish\" }";
+ public string ResponseAudioMode { get; set; } = VoiceTextToSpeechResponseModes.Binary;
+ public string? ResponseAudioJsonPath { get; set; } = "data.audio";
+ public string? ResponseStatusCodeJsonPath { get; set; } = "base_resp.status_code";
+ public string? ResponseStatusMessageJsonPath { get; set; } = "base_resp.status_msg";
+ public string? FinalFlagJsonPath { get; set; } = "is_final";
+ public string TaskFailedEventName { get; set; } = "task_failed";
+ public string? SuccessStatusValue { get; set; } = "0";
+ public string OutputContentType { get; set; } = "audio/mpeg";
+}
+
public sealed class VoiceProviderOption
{
public string Id { get; set; } = "";
@@ -217,6 +238,7 @@ public sealed class VoiceProviderOption
public string? Description { get; set; }
public List<VoiceProviderSettingDefinition> Settings { get; set; } = [];
public VoiceTextToSpeechHttpContract? TextToSpeechHttp { get; set; }
+ public VoiceTextToSpeechWebSocketContract? TextToSpeechWebSocket { get; set; }
}
public sealed class VoiceProviderCatalog
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index 1df157b..c0b86d4 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -58,18 +58,22 @@
"description": "Optional full MiniMax request fragment. If present, it controls the full voice_setting payload."
}
],
- "textToSpeechHttp": {
- "endpointTemplate": "https://api-uw.minimax.io/v1/t2a_v2",
- "httpMethod": "POST",
+ "textToSpeechWebSocket": {
+ "endpointTemplate": "wss://api.minimax.io/ws/v1/t2a_v2",
"authenticationHeaderName": "Authorization",
"authenticationScheme": "Bearer",
"apiKeySettingKey": "apiKey",
- "requestContentType": "application/json",
- "requestBodyTemplate": "{ \"model\": {{model}}, \"text\": {{text}}, \"stream\": false, \"language_boost\": \"English\", \"output_format\": \"hex\", {{voiceSettingsJson}}, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "connectSuccessEventName": "connected_success",
+ "startMessageTemplate": "{ \"event\": \"task_start\", \"model\": {{model}}, \"language_boost\": \"English\", {{voiceSettingsJson}}, \"audio_setting\": { \"sample_rate\": 32000, \"bitrate\": 128000, \"format\": \"mp3\", \"channel\": 1 } }",
+ "startSuccessEventName": "task_started",
+ "continueMessageTemplate": "{ \"event\": \"task_continue\", \"text\": {{text}} }",
+ "finishMessageTemplate": "{ \"event\": \"task_finish\" }",
"responseAudioMode": "hexJsonString",
"responseAudioJsonPath": "data.audio",
"responseStatusCodeJsonPath": "base_resp.status_code",
"responseStatusMessageJsonPath": "base_resp.status_msg",
+ "finalFlagJsonPath": "is_final",
+ "taskFailedEventName": "task_failed",
"successStatusValue": "0",
"outputContentType": "audio/mpeg"
}
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 70dfc46..d16ea54 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -446,6 +446,28 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
ResponseStatusMessageJsonPath = source.TextToSpeechHttp.ResponseStatusMessageJsonPath,
SuccessStatusValue = source.TextToSpeechHttp.SuccessStatusValue,
OutputContentType = source.TextToSpeechHttp.OutputContentType
+ },
+ TextToSpeechWebSocket = source.TextToSpeechWebSocket == null
+ ? null
+ : new VoiceTextToSpeechWebSocketContract
+ {
+ EndpointTemplate = source.TextToSpeechWebSocket.EndpointTemplate,
+ AuthenticationHeaderName = source.TextToSpeechWebSocket.AuthenticationHeaderName,
+ AuthenticationScheme = source.TextToSpeechWebSocket.AuthenticationScheme,
+ ApiKeySettingKey = source.TextToSpeechWebSocket.ApiKeySettingKey,
+ ConnectSuccessEventName = source.TextToSpeechWebSocket.ConnectSuccessEventName,
+ StartMessageTemplate = source.TextToSpeechWebSocket.StartMessageTemplate,
+ StartSuccessEventName = source.TextToSpeechWebSocket.StartSuccessEventName,
+ ContinueMessageTemplate = source.TextToSpeechWebSocket.ContinueMessageTemplate,
+ FinishMessageTemplate = source.TextToSpeechWebSocket.FinishMessageTemplate,
+ ResponseAudioMode = source.TextToSpeechWebSocket.ResponseAudioMode,
+ ResponseAudioJsonPath = source.TextToSpeechWebSocket.ResponseAudioJsonPath,
+ ResponseStatusCodeJsonPath = source.TextToSpeechWebSocket.ResponseStatusCodeJsonPath,
+ ResponseStatusMessageJsonPath = source.TextToSpeechWebSocket.ResponseStatusMessageJsonPath,
+ FinalFlagJsonPath = source.TextToSpeechWebSocket.FinalFlagJsonPath,
+ TaskFailedEventName = source.TextToSpeechWebSocket.TaskFailedEventName,
+ SuccessStatusValue = source.TextToSpeechWebSocket.SuccessStatusValue,
+ OutputContentType = source.TextToSpeechWebSocket.OutputContentType
}
};
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index 8830660..859b6c8 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -1,11 +1,14 @@
using System;
using System.Diagnostics;
using System.Collections.Generic;
+using System.IO;
using System.Net.Http;
using System.Net.Http.Headers;
+using System.Net.WebSockets;
using System.Runtime.InteropServices.WindowsRuntime;
using System.Text;
using System.Text.Json;
+using System.Threading;
using System.Threading.Tasks;
using OpenClaw.Shared;
using Windows.Storage.Streams;
@@ -26,6 +29,11 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
ArgumentNullException.ThrowIfNull(provider);
ArgumentNullException.ThrowIfNull(configurationStore);
+ if (provider.TextToSpeechWebSocket != null)
+ {
+ return await SynthesizeViaWebSocketAsync(text, provider, configurationStore, logger);
+ }
+
var contract = provider.TextToSpeechHttp
?? throw new InvalidOperationException($"TTS provider '{provider.Name}' does not expose an HTTP contract.");
var providerConfiguration = configurationStore.FindProvider(provider.Id);
@@ -71,11 +79,101 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
return jsonResult;
}
+ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAsync(
+ string text,
+ VoiceProviderOption provider,
+ VoiceProviderConfigurationStore configurationStore,
+ IOpenClawLogger? logger)
+ {
+ var contract = provider.TextToSpeechWebSocket
+ ?? throw new InvalidOperationException($"TTS provider '{provider.Name}' does not expose a WebSocket contract.");
+ var providerConfiguration = configurationStore.FindProvider(provider.Id);
+ var templateValues = BuildTemplateValues(text, provider, providerConfiguration, contract.ApiKeySettingKey);
+ var endpoint = ApplyUrlTemplate(contract.EndpointTemplate, templateValues);
+ using var socket = new ClientWebSocket();
+ ApplyAuthenticationHeader(socket.Options, contract, templateValues);
+
+ var stopwatch = Stopwatch.StartNew();
+ await socket.ConnectAsync(new Uri(endpoint), CancellationToken.None);
+ var connectedMessage = await ReceiveJsonMessageAsync(socket);
+ ValidateWebSocketEvent(provider.Name, contract.ConnectSuccessEventName, connectedMessage, contract);
+
+ var startMessage = ApplyJsonTemplate(contract.StartMessageTemplate, templateValues);
+ await SendTextMessageAsync(socket, startMessage);
+ var startedMessage = await ReceiveJsonMessageAsync(socket);
+ ValidateWebSocketEvent(provider.Name, contract.StartSuccessEventName, startedMessage, contract);
+
+ var continueMessage = ApplyJsonTemplate(contract.ContinueMessageTemplate, templateValues);
+ await SendTextMessageAsync(socket, continueMessage);
+
+ var audioBytes = new List<byte>();
+ long? firstChunkMs = null;
+
+ while (true)
+ {
+ var message = await ReceiveJsonMessageAsync(socket);
+ EnsureWebSocketNotFailed(provider.Name, contract, message);
+
+ if (TryGetJsonString(message, contract.ResponseAudioJsonPath, out var audioChunk) &&
+ !string.IsNullOrWhiteSpace(audioChunk))
+ {
+ if (!firstChunkMs.HasValue)
+ {
+ firstChunkMs = stopwatch.ElapsedMilliseconds;
+ }
+
+ audioBytes.AddRange(DecodeAudioBytes(contract.ResponseAudioMode, audioChunk, provider.Name));
+ }
+
+ if (IsFinalWebSocketMessage(message, contract))
+ {
+ break;
+ }
+ }
+
+ if (!string.IsNullOrWhiteSpace(contract.FinishMessageTemplate))
+ {
+ try
+ {
+ await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues));
+ }
+ catch
+ {
+ }
+ }
+
+ try
+ {
+ await socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "done", CancellationToken.None);
+ }
+ catch
+ {
+ }
+
+ if (audioBytes.Count == 0)
+ {
+ throw new InvalidOperationException($"{provider.Name} TTS did not return any audio data.");
+ }
+
+ var result = await CreateResultAsync(audioBytes.ToArray(), contract.OutputContentType);
+ logger?.Info($"{provider.Name} TTS latency: firstChunk={(firstChunkMs?.ToString() ?? "n/a")}ms total={stopwatch.ElapsedMilliseconds}ms (websocket)");
+ return result;
+ }
+
private static Dictionary<string, TemplateValue> BuildTemplateValues(
string text,
VoiceProviderOption provider,
VoiceProviderConfiguration? providerConfiguration,
VoiceTextToSpeechHttpContract contract)
+ {
+ return BuildTemplateValues(text, provider, providerConfiguration, contract.ApiKeySettingKey);
+ }
+
+ private static Dictionary<string, TemplateValue> BuildTemplateValues(
+ string text,
+ VoiceProviderOption provider,
+ VoiceProviderConfiguration? providerConfiguration,
+ string apiKeySettingKey)
{
var values = new Dictionary<string, TemplateValue>(StringComparer.OrdinalIgnoreCase)
{
@@ -91,7 +189,7 @@ private static Dictionary<string, TemplateValue> BuildTemplateValues(
if (string.IsNullOrWhiteSpace(effectiveValue))
{
- if (setting.Secret || string.Equals(setting.Key, contract.ApiKeySettingKey, StringComparison.OrdinalIgnoreCase))
+ if (setting.Secret || string.Equals(setting.Key, apiKeySettingKey, StringComparison.OrdinalIgnoreCase))
{
throw new InvalidOperationException(
$"{provider.Name} API key is not configured. Open Settings and complete the {provider.Name} voice provider fields.");
@@ -165,6 +263,26 @@ private static void ApplyAuthenticationHeader(
request.Headers.TryAddWithoutValidation(contract.AuthenticationHeaderName, headerValue);
}
+ private static void ApplyAuthenticationHeader(
+ ClientWebSocketOptions options,
+ VoiceTextToSpeechWebSocketContract contract,
+ IReadOnlyDictionary<string, TemplateValue> values)
+ {
+ if (!values.TryGetValue(contract.ApiKeySettingKey, out var apiKey) || string.IsNullOrWhiteSpace(apiKey.Value))
+ {
+ throw new InvalidOperationException("Voice provider API key is not configured.");
+ }
+
+ var headerValue = string.Equals(contract.AuthenticationHeaderName, "Authorization", StringComparison.OrdinalIgnoreCase) &&
+ !string.IsNullOrWhiteSpace(contract.AuthenticationScheme)
+ ? $"{contract.AuthenticationScheme} {apiKey.Value}"
+ : string.IsNullOrWhiteSpace(contract.AuthenticationScheme)
+ ? apiKey.Value
+ : $"{contract.AuthenticationScheme} {apiKey.Value}";
+
+ options.SetRequestHeader(contract.AuthenticationHeaderName, headerValue);
+ }
+
private static HttpMethod ParseHttpMethod(string? method)
{
if (string.Equals(method, HttpMethod.Post.Method, StringComparison.OrdinalIgnoreCase))
@@ -204,6 +322,42 @@ private static void ValidateResponseStatus(
: $"{provider.Name} TTS returned an error: {statusMessage}");
}
+ private static void ValidateWebSocketEvent(
+ string providerName,
+ string expectedEvent,
+ JsonElement message,
+ VoiceTextToSpeechWebSocketContract contract)
+ {
+ EnsureWebSocketNotFailed(providerName, contract, message);
+
+ if (!TryGetJsonString(message, "event", out var eventName) ||
+ !string.Equals(eventName, expectedEvent, StringComparison.OrdinalIgnoreCase))
+ {
+ throw new InvalidOperationException($"{providerName} TTS returned an unexpected WebSocket event.");
+ }
+ }
+
+ private static void EnsureWebSocketNotFailed(
+ string providerName,
+ VoiceTextToSpeechWebSocketContract contract,
+ JsonElement message)
+ {
+ if (TryGetJsonString(message, "event", out var eventName) &&
+ string.Equals(eventName, contract.TaskFailedEventName, StringComparison.OrdinalIgnoreCase))
+ {
+ var statusMessage = string.IsNullOrWhiteSpace(contract.ResponseStatusMessageJsonPath)
+ ? null
+ : TryGetJsonString(message, contract.ResponseStatusMessageJsonPath, out var value)
+ ? value
+ : null;
+
+ throw new InvalidOperationException(
+ string.IsNullOrWhiteSpace(statusMessage)
+ ? $"{providerName} TTS returned an error."
+ : $"{providerName} TTS returned an error: {statusMessage}");
+ }
+ }
+
private static JsonElement? GetJsonValue(JsonElement root, string? path)
{
if (string.IsNullOrWhiteSpace(path))
@@ -240,6 +394,31 @@ private static string GetRequiredJsonString(JsonElement root, string? path)
return text;
}
+ private static bool TryGetJsonString(JsonElement root, string? path, out string value)
+ {
+ value = string.Empty;
+ var found = GetJsonValue(root, path);
+ if (!found.HasValue)
+ {
+ return false;
+ }
+
+ var text = JsonElementToString(found.Value);
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return false;
+ }
+
+ value = text;
+ return true;
+ }
+
+ private static bool IsFinalWebSocketMessage(JsonElement root, VoiceTextToSpeechWebSocketContract contract)
+ {
+ var finalFlag = GetJsonValue(root, contract.FinalFlagJsonPath);
+ return finalFlag.HasValue && finalFlag.Value.ValueKind == JsonValueKind.True;
+ }
+
private static string? JsonElementToString(JsonElement element)
{
return element.ValueKind switch
@@ -297,6 +476,39 @@ private static async Task<VoiceCloudTextToSpeechResult> CreateResultAsync(Stream
return new VoiceCloudTextToSpeechResult(stream, string.IsNullOrWhiteSpace(contentType) ? "audio/mpeg" : contentType);
}
+ private static async Task SendTextMessageAsync(ClientWebSocket socket, string message)
+ {
+ var bytes = Encoding.UTF8.GetBytes(message);
+ await socket.SendAsync(bytes, WebSocketMessageType.Text, true, CancellationToken.None);
+ }
+
+ private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket socket)
+ {
+ using var buffer = new MemoryStream();
+ var receiveBuffer = new byte[8192];
+
+ while (true)
+ {
+ var segment = new ArraySegment<byte>(receiveBuffer);
+ var result = await socket.ReceiveAsync(segment, CancellationToken.None);
+
+ if (result.MessageType == WebSocketMessageType.Close)
+ {
+ throw new InvalidOperationException("Voice provider closed the WebSocket unexpectedly.");
+ }
+
+ buffer.Write(receiveBuffer, 0, result.Count);
+ if (result.EndOfMessage)
+ {
+ break;
+ }
+ }
+
+ var text = Encoding.UTF8.GetString(buffer.ToArray());
+ using var document = JsonDocument.Parse(text);
+ return document.RootElement.Clone();
+ }
+
private static HttpClient CreateHttpClient()
{
return new HttpClient
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index 6fa68fe..275806e 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -79,7 +79,7 @@ public static bool SupportsTextToSpeechRuntime(string? providerId)
}
var provider = ResolveTextToSpeechProvider(providerId);
- return provider.TextToSpeechHttp != null;
+ return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
}
private static VoiceProviderCatalog NormalizeCatalog(VoiceProviderCatalog catalog)
@@ -134,7 +134,8 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
Enabled = source.Enabled,
Description = source.Description,
Settings = source.Settings.Select(Clone).ToList(),
- TextToSpeechHttp = Clone(source.TextToSpeechHttp)
+ TextToSpeechHttp = Clone(source.TextToSpeechHttp),
+ TextToSpeechWebSocket = Clone(source.TextToSpeechWebSocket)
};
}
@@ -179,6 +180,35 @@ private static VoiceProviderSettingDefinition Clone(VoiceProviderSettingDefiniti
};
}
+ private static VoiceTextToSpeechWebSocketContract? Clone(VoiceTextToSpeechWebSocketContract? source)
+ {
+ if (source == null)
+ {
+ return null;
+ }
+
+ return new VoiceTextToSpeechWebSocketContract
+ {
+ EndpointTemplate = source.EndpointTemplate,
+ AuthenticationHeaderName = source.AuthenticationHeaderName,
+ AuthenticationScheme = source.AuthenticationScheme,
+ ApiKeySettingKey = source.ApiKeySettingKey,
+ ConnectSuccessEventName = source.ConnectSuccessEventName,
+ StartMessageTemplate = source.StartMessageTemplate,
+ StartSuccessEventName = source.StartSuccessEventName,
+ ContinueMessageTemplate = source.ContinueMessageTemplate,
+ FinishMessageTemplate = source.FinishMessageTemplate,
+ ResponseAudioMode = source.ResponseAudioMode,
+ ResponseAudioJsonPath = source.ResponseAudioJsonPath,
+ ResponseStatusCodeJsonPath = source.ResponseStatusCodeJsonPath,
+ ResponseStatusMessageJsonPath = source.ResponseStatusMessageJsonPath,
+ FinalFlagJsonPath = source.FinalFlagJsonPath,
+ TaskFailedEventName = source.TaskFailedEventName,
+ SuccessStatusValue = source.SuccessStatusValue,
+ OutputContentType = source.OutputContentType
+ };
+ }
+
private static string ResolveCatalogFilePath()
{
var bundledPath = Path.Combine(AppContext.BaseDirectory, CatalogRelativePath);
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 3bef979..7e4c6f1 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -40,10 +40,12 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
var minimax = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
Assert.Equal("MiniMax", minimax.Name);
- Assert.NotNull(minimax.TextToSpeechHttp);
- Assert.Equal("https://api-uw.minimax.io/v1/t2a_v2", minimax.TextToSpeechHttp!.EndpointTemplate);
- Assert.Equal("Authorization", minimax.TextToSpeechHttp.AuthenticationHeaderName);
- Assert.Equal(VoiceTextToSpeechResponseModes.HexJsonString, minimax.TextToSpeechHttp.ResponseAudioMode);
+ Assert.NotNull(minimax.TextToSpeechWebSocket);
+ Assert.Equal("wss://api.minimax.io/ws/v1/t2a_v2", minimax.TextToSpeechWebSocket!.EndpointTemplate);
+ Assert.Equal("Authorization", minimax.TextToSpeechWebSocket.AuthenticationHeaderName);
+ Assert.Equal(VoiceTextToSpeechResponseModes.HexJsonString, minimax.TextToSpeechWebSocket.ResponseAudioMode);
+ Assert.Contains("\"event\": \"task_start\"", minimax.TextToSpeechWebSocket.StartMessageTemplate);
+ Assert.Contains("\"event\": \"task_continue\"", minimax.TextToSpeechWebSocket.ContinueMessageTemplate);
var minimaxModelSetting = minimax.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model);
Assert.Equal("speech-2.8-turbo", minimaxModelSetting.DefaultValue);
Assert.Contains("speech-2.8-turbo", minimaxModelSetting.Options);
From 45ff8f8c0a2b549e2aef5d3599c97a24cbb4c93f Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 22:57:02 +0000
Subject: [PATCH 26/83] Fix voice restart after settings save
---
src/OpenClaw.Tray.WinUI/App.xaml.cs | 65 +++++++++++++++----
.../Services/Voice/VoiceService.cs | 7 ++
2 files changed, 61 insertions(+), 11 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index b6b909f..2b67272 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -1687,23 +1687,66 @@ private void ShowVoiceModeSettings()
_voiceModeWindow.Activate();
}
- private void OnSettingsSaved(object? sender, EventArgs e)
+ private async void OnSettingsSaved(object? sender, EventArgs e)
{
// Reconnect with new settings ΓÇö mirror the startup if/else pattern
// to avoid dual connections that cause gateway conflicts.
- _gatewayClient?.Dispose();
- var oldNodeService = _nodeService;
- _nodeService = null;
- try { oldNodeService?.Dispose(); } catch (Exception ex) { Logger.Warn($"Node dispose error: {ex.Message}"); }
-
- if (_settings?.EnableNodeMode == true)
+ try
{
- InitializeNodeService();
+ if (_gatewayClient != null)
+ {
+ try
+ {
+ await _gatewayClient.DisconnectAsync();
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Gateway disconnect error: {ex.Message}");
+ }
+
+ _gatewayClient.Dispose();
+ _gatewayClient = null;
+ }
+
+ var oldNodeService = _nodeService;
+ _nodeService = null;
+ if (oldNodeService != null)
+ {
+ try
+ {
+ await oldNodeService.DisconnectAsync();
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Node disconnect error: {ex.Message}");
+ }
+
+ try
+ {
+ oldNodeService.Dispose();
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Node dispose error: {ex.Message}");
+ }
+ }
+
+ if (_settings?.EnableNodeMode == true)
+ {
+ InitializeNodeService();
+ }
+ else
+ {
+ InitializeGatewayClient();
+ if (_voiceService != null)
+ {
+ await _voiceService.StopAsync(new VoiceStopArgs { Reason = "Node mode disabled" });
+ }
+ }
}
- else
+ catch (Exception ex)
{
- InitializeGatewayClient();
- _ = _voiceService?.StopAsync(new VoiceStopArgs { Reason = "Node mode disabled" });
+ Logger.Warn($"Settings reconnect failed: {ex.Message}");
}
// Update global hotkey
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index a8ede20..8cade6c 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -483,6 +483,8 @@ private async Task<SpeechRecognizer> CreateSpeechRecognizerAsync(VoiceSettings s
throw new InvalidOperationException($"Speech recognizer unavailable: {compilation.Status}");
}
+ _logger.Info($"Speech recognizer compiled successfully ({compilation.Status})");
+
return recognizer;
}
@@ -583,6 +585,7 @@ private async Task StartRecognitionSessionAsync()
}
}
+ _logger.Info("Starting speech recognition session");
await recognizer.ContinuousRecognitionSession.StartAsync();
lock (_gate)
@@ -597,6 +600,8 @@ private async Task StartRecognitionSessionAsync()
null);
}
}
+
+ _logger.Info("Speech recognition session started");
}
private async Task StopRecognitionSessionAsync()
@@ -1065,6 +1070,8 @@ private async void OnSpeechRecognitionCompleted(
!_isSpeaking;
}
+ _logger.Warn($"Speech recognition session completed with status {args.Status}; restart={shouldRestart}");
+
if (shouldRestart && !token.IsCancellationRequested)
{
await Task.Delay(250, token);
From 71d0de4286bc3e1b13185145bf6b807c8fac0d9a Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 23:03:09 +0000
Subject: [PATCH 27/83] Fix MiniMax websocket voice playback routing
---
.../Services/Voice/VoiceService.cs | 7 ++++++-
.../VoiceServiceTransportTests.cs | 21 +++++++++++++++++++
2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 8cade6c..289f33b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -997,7 +997,7 @@ private async Task SpeakTextAsync(string text)
settings.TextToSpeechProviderId,
_logger);
- if (provider.TextToSpeechHttp != null)
+ if (UsesCloudTextToSpeechRuntime(provider))
{
using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger);
await PlayStreamAsync(player, result.Stream, result.ContentType);
@@ -1015,6 +1015,11 @@ private async Task SpeakTextAsync(string text)
await PlayStreamAsync(player, stream, stream.ContentType);
}
+ private static bool UsesCloudTextToSpeechRuntime(VoiceProviderOption provider)
+ {
+ return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
+ }
+
private static async Task PlayStreamAsync(
MediaPlayer player,
IRandomAccessStream stream,
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 8a3e43c..2686278 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -45,6 +45,27 @@ public void GetOrCreateTransportReadySource_CreatesFreshTaskAfterError()
Assert.True((bool)arguments[2]!);
}
+ [Fact]
+ public void UsesCloudTextToSpeechRuntime_ReturnsTrueForWebSocketProviders()
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "UsesCloudTextToSpeechRuntime",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var provider = new VoiceProviderOption
+ {
+ Id = VoiceProviderIds.MiniMax,
+ TextToSpeechWebSocket = new VoiceTextToSpeechWebSocketContract
+ {
+ EndpointTemplate = "wss://example.test/tts"
+ }
+ };
+
+ var result = (bool)method.Invoke(null, [provider])!;
+
+ Assert.True(result);
+ }
+
private static MethodInfo GetMethod()
{
return typeof(VoiceService).GetMethod(
From 91ccec377f19b26c7b2870facedc5a85f31a7d7e Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 23:28:10 +0000
Subject: [PATCH 28/83] Add dynamic tray icons for voice states
---
src/OpenClaw.Tray.WinUI/App.xaml.cs | 73 ++++++++-
src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs | 154 +++++++++++++++++-
.../Services/Voice/VoiceService.cs | 8 +
.../VoiceProviderCatalogServiceTests.cs | 19 +++
4 files changed, 245 insertions(+), 9 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 2b67272..8ecb2ff 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -38,6 +38,7 @@ public partial class App : Application
private GlobalHotkeyService? _globalHotkey;
private System.Timers.Timer? _healthCheckTimer;
private System.Timers.Timer? _sessionPollTimer;
+ private Microsoft.UI.Dispatching.DispatcherQueueTimer? _voiceTrayIconTimer;
private Mutex? _mutex;
private Microsoft.UI.Dispatching.DispatcherQueue? _dispatcherQueue;
private CancellationTokenSource? _deepLinkCts;
@@ -55,6 +56,7 @@ public partial class App : Application
private GatewayCostUsageInfo? _lastUsageCost;
private DateTime _lastCheckTime = DateTime.Now;
private DateTime _lastUsageActivityLogUtc = DateTime.MinValue;
+ private string? _lastTrayIconPath;
// Session-aware activity tracking
private readonly Dictionary<string, AgentActivity> _sessionActivities = new();
@@ -286,6 +288,7 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
// Start health check timer
StartHealthCheckTimer();
+ StartVoiceTrayIconTimer();
// Start deep link server
StartDeepLinkServer();
@@ -333,11 +336,26 @@ private void InitializeTrayIcon()
var iconPath = IconHelper.GetStatusIconPath(ConnectionStatus.Disconnected);
_trayIcon = new TrayIcon(1, iconPath, "OpenClaw Tray ΓÇö Disconnected");
+ _lastTrayIconPath = iconPath;
_trayIcon.IsVisible = true;
_trayIcon.Selected += OnTrayIconSelected;
_trayIcon.ContextMenu += OnTrayContextMenu;
}
+ private void StartVoiceTrayIconTimer()
+ {
+ if (_dispatcherQueue == null || _voiceTrayIconTimer != null)
+ {
+ return;
+ }
+
+ _voiceTrayIconTimer = _dispatcherQueue.CreateTimer();
+ _voiceTrayIconTimer.Interval = TimeSpan.FromMilliseconds(250);
+ _voiceTrayIconTimer.IsRepeating = true;
+ _voiceTrayIconTimer.Tick += (s, e) => UpdateTrayIcon();
+ _voiceTrayIconTimer.Start();
+ }
+
private void InitializeTrayMenuWindow()
{
// Pre-create menu window once - reuse to avoid crash on window creation after idle
@@ -1622,13 +1640,7 @@ private void UpdateTrayIcon()
{
if (_trayIcon == null) return;
- var status = _currentStatus;
- if (_currentActivity != null && _currentActivity.Kind != OpenClaw.Shared.ActivityKind.Idle)
- {
- status = ConnectionStatus.Connecting; // Use connecting icon for activity
- }
-
- var iconPath = IconHelper.GetStatusIconPath(status);
+ var iconPath = GetTrayIconPathForCurrentState();
var tooltip = $"OpenClaw Tray ΓÇö {_currentStatus}";
if (_currentActivity != null && !string.IsNullOrEmpty(_currentActivity.DisplayText))
@@ -1640,7 +1652,11 @@ private void UpdateTrayIcon()
try
{
- _trayIcon.SetIcon(iconPath);
+ if (!string.Equals(_lastTrayIconPath, iconPath, StringComparison.OrdinalIgnoreCase))
+ {
+ _trayIcon.SetIcon(iconPath);
+ _lastTrayIconPath = iconPath;
+ }
_trayIcon.Tooltip = tooltip;
}
catch (Exception ex)
@@ -1649,6 +1665,46 @@ private void UpdateTrayIcon()
}
}
+ private string GetTrayIconPathForCurrentState()
+ {
+ var voiceIconState = GetVoiceTrayIconState();
+ if (voiceIconState != VoiceTrayIconState.Off)
+ {
+ return IconHelper.GetVoiceTrayIconPath(voiceIconState);
+ }
+
+ if (_voiceService?.CurrentStatus.State == VoiceRuntimeState.Paused)
+ {
+ return IconHelper.GetVoiceTrayIconPath(VoiceTrayIconState.Off);
+ }
+
+ var status = _currentStatus;
+ if (_currentActivity != null && _currentActivity.Kind != OpenClaw.Shared.ActivityKind.Idle)
+ {
+ status = ConnectionStatus.Connecting;
+ }
+
+ return IconHelper.GetStatusIconPath(status);
+ }
+
+ private VoiceTrayIconState GetVoiceTrayIconState()
+ {
+ var voiceStatus = _voiceService?.CurrentStatus;
+ if (voiceStatus == null || !voiceStatus.Running)
+ {
+ return VoiceTrayIconState.Off;
+ }
+
+ return voiceStatus.State switch
+ {
+ VoiceRuntimeState.PlayingResponse => VoiceTrayIconState.Speaking,
+ VoiceRuntimeState.RecordingUtterance => VoiceTrayIconState.Listening,
+ VoiceRuntimeState.Paused => VoiceTrayIconState.Off,
+ _ when voiceStatus.Mode == VoiceActivationMode.Off => VoiceTrayIconState.Off,
+ _ => VoiceTrayIconState.Armed
+ };
+ }
+
#endregion
#region Window Management
@@ -2323,6 +2379,7 @@ private void ExitApplication()
_healthCheckTimer?.Dispose();
_sessionPollTimer?.Stop();
_sessionPollTimer?.Dispose();
+ _voiceTrayIconTimer?.Stop();
// Cleanup hotkey
_globalHotkey?.Dispose();
diff --git a/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs b/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs
index d2181cd..71a28ab 100644
--- a/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs
+++ b/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs
@@ -6,6 +6,14 @@
namespace OpenClawTray.Helpers;
+public enum VoiceTrayIconState
+{
+ Off,
+ Armed,
+ Listening,
+ Speaking
+}
+
/// <summary>
/// Provides icon resources for the tray application.
/// Creates dynamic status icons with lobster pixel art.
@@ -14,6 +22,10 @@ public static class IconHelper
{
private static readonly string AssetsPath = Path.Combine(AppContext.BaseDirectory, "Assets");
private static readonly string IconsPath = Path.Combine(AssetsPath, "Icons");
+ private static readonly string GeneratedIconsPath = Path.Combine(
+ Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData),
+ "OpenClawTray",
+ "GeneratedIcons");
// Icon cache
private static Icon? _connectedIcon;
@@ -21,6 +33,9 @@ public static class IconHelper
private static Icon? _activityIcon;
private static Icon? _errorIcon;
private static Icon? _appIcon;
+ private static string? _voiceArmedIconPath;
+ private static string? _voiceListeningIconPath;
+ private static string? _voiceSpeakingIconPath;
public static string GetStatusIconPath(ConnectionStatus status)
{
@@ -43,6 +58,28 @@ public static string GetStatusIconPath(ConnectionStatus status)
return path;
}
+ public static string GetAppIconPath()
+ {
+ var path = Path.Combine(AssetsPath, "openclaw.ico");
+ if (File.Exists(path))
+ {
+ return path;
+ }
+
+ return GetStatusIconPath(ConnectionStatus.Disconnected);
+ }
+
+ public static string GetVoiceTrayIconPath(VoiceTrayIconState state)
+ {
+ return state switch
+ {
+ VoiceTrayIconState.Armed => GetOrCreateVoiceIconPath(ref _voiceArmedIconPath, VoiceTrayIconState.Armed),
+ VoiceTrayIconState.Listening => GetOrCreateVoiceIconPath(ref _voiceListeningIconPath, VoiceTrayIconState.Listening),
+ VoiceTrayIconState.Speaking => GetOrCreateVoiceIconPath(ref _voiceSpeakingIconPath, VoiceTrayIconState.Speaking),
+ _ => GetAppIconPath()
+ };
+ }
+
public static Icon GetStatusIcon(ConnectionStatus status)
{
return status switch
@@ -58,7 +95,7 @@ public static Icon GetAppIcon()
{
if (_appIcon != null) return _appIcon;
- var iconPath = Path.Combine(AssetsPath, "openclaw.ico");
+ var iconPath = GetAppIconPath();
if (File.Exists(iconPath))
{
_appIcon = new Icon(iconPath);
@@ -140,6 +177,121 @@ public static Icon CreateLobsterIcon(Color color)
return result;
}
+ private static string GetOrCreateVoiceIconPath(ref string? cachedPath, VoiceTrayIconState state)
+ {
+ if (!string.IsNullOrWhiteSpace(cachedPath) && File.Exists(cachedPath))
+ {
+ return cachedPath;
+ }
+
+ Directory.CreateDirectory(GeneratedIconsPath);
+ var outputPath = Path.Combine(GeneratedIconsPath, $"voice-{state.ToString().ToLowerInvariant()}.ico");
+
+ using var bitmap = CreateVoiceTrayBitmap(state);
+ using var icon = CreateIcon(bitmap);
+ using var stream = File.Create(outputPath);
+ icon.Save(stream);
+
+ cachedPath = outputPath;
+ return outputPath;
+ }
+
+ private static Bitmap CreateVoiceTrayBitmap(VoiceTrayIconState state)
+ {
+ const int size = 32;
+ var bitmap = new Bitmap(size, size);
+ using var graphics = Graphics.FromImage(bitmap);
+
+ graphics.Clear(Color.Transparent);
+ graphics.SmoothingMode = System.Drawing.Drawing2D.SmoothingMode.AntiAlias;
+ graphics.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
+
+ using (var baseIcon = new Icon(GetAppIconPath(), size, size))
+ using (var baseBitmap = baseIcon.ToBitmap())
+ {
+ graphics.DrawImage(baseBitmap, 0, 0, size, size);
+ }
+
+ switch (state)
+ {
+ case VoiceTrayIconState.Armed:
+ DrawHeadphones(graphics);
+ break;
+ case VoiceTrayIconState.Listening:
+ DrawHeadphones(graphics);
+ DrawMicrophone(graphics);
+ break;
+ case VoiceTrayIconState.Speaking:
+ DrawHeadphones(graphics);
+ DrawSpeaker(graphics);
+ break;
+ }
+
+ return bitmap;
+ }
+
+ private static void DrawHeadphones(Graphics graphics)
+ {
+ using var shadowPen = new Pen(Color.FromArgb(96, 255, 255, 255), 4f);
+ using var bandPen = new Pen(Color.FromArgb(42, 48, 58), 3f);
+ using var earBrush = new SolidBrush(Color.FromArgb(42, 48, 58));
+
+ graphics.DrawArc(shadowPen, 6, 3, 20, 16, 180, 180);
+ graphics.DrawArc(bandPen, 6, 3, 20, 16, 180, 180);
+ graphics.FillPath(earBrush, CreateRoundedRectanglePath(4, 12, 5, 10, 3));
+ graphics.FillPath(earBrush, CreateRoundedRectanglePath(23, 12, 5, 10, 3));
+ }
+
+ private static void DrawMicrophone(Graphics graphics)
+ {
+ using var brush = new SolidBrush(Color.FromArgb(33, 150, 243));
+ using var pen = new Pen(Color.FromArgb(33, 150, 243), 2f);
+
+ graphics.FillPath(brush, CreateRoundedRectanglePath(22, 17, 6, 9, 3));
+ graphics.FillRectangle(brush, 24, 25, 2, 4);
+ graphics.DrawArc(pen, 21, 27, 8, 5, 0, 180);
+ graphics.DrawLine(pen, 20, 21, 15, 19);
+ }
+
+ private static void DrawSpeaker(Graphics graphics)
+ {
+ using var brush = new SolidBrush(Color.FromArgb(76, 175, 80));
+ using var pen = new Pen(Color.FromArgb(76, 175, 80), 2f);
+ using var thinPen = new Pen(Color.FromArgb(76, 175, 80), 1.5f);
+
+ var points = new[]
+ {
+ new Point(24, 17),
+ new Point(19, 20),
+ new Point(19, 24),
+ new Point(24, 27)
+ };
+
+ graphics.FillPolygon(brush, points);
+ graphics.DrawArc(pen, 22, 17, 6, 10, 300, 120);
+ graphics.DrawArc(thinPen, 21, 14, 10, 16, 300, 120);
+ }
+
+ private static Icon CreateIcon(Bitmap bitmap)
+ {
+ var handle = bitmap.GetHicon();
+ var icon = Icon.FromHandle(handle);
+ var result = (Icon)icon.Clone();
+ DestroyIcon(handle);
+ return result;
+ }
+
+ private static System.Drawing.Drawing2D.GraphicsPath CreateRoundedRectanglePath(int x, int y, int width, int height, int radius)
+ {
+ var path = new System.Drawing.Drawing2D.GraphicsPath();
+ path.AddArc(x, y, radius, radius, 180, 90);
+ path.AddArc(x + width - radius, y, radius, radius, 270, 90);
+ path.AddArc(x + width - radius, y + height - radius, radius, radius, 0, 90);
+ path.AddArc(x, y + height - radius, radius, radius, 90, 90);
+ path.CloseFigure();
+ return path;
+ }
+
[DllImport("user32.dll", CharSet = CharSet.Auto)]
private static extern bool DestroyIcon(IntPtr handle);
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 289f33b..26a1d5e 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -684,6 +684,14 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
text = args.Hypothesis?.Text?.Trim();
sessionKey = GetCurrentVoiceSessionKey();
+ if (_status.State != VoiceRuntimeState.RecordingUtterance)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.RecordingUtterance,
+ _status.LastError);
+ }
}
if (string.IsNullOrWhiteSpace(text))
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 7e4c6f1..192ac01 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -1,6 +1,7 @@
using System;
using System.IO;
using OpenClaw.Shared;
+using OpenClawTray.Helpers;
using OpenClawTray.Services.Voice;
using System.Linq;
@@ -8,6 +9,24 @@ namespace OpenClaw.Tray.Tests;
public class VoiceProviderCatalogServiceTests
{
+ [Fact]
+ public void GetVoiceTrayIconPath_ReturnsBundledAppIconForOff()
+ {
+ var path = IconHelper.GetVoiceTrayIconPath(VoiceTrayIconState.Off);
+
+ Assert.Equal(IconHelper.GetAppIconPath(), path, ignoreCase: true);
+ }
+
+ [Fact]
+ public void GetVoiceTrayIconPath_GeneratesListeningVariant()
+ {
+ var path = IconHelper.GetVoiceTrayIconPath(VoiceTrayIconState.Listening);
+
+ Assert.True(File.Exists(path));
+ Assert.EndsWith(".ico", path, StringComparison.OrdinalIgnoreCase);
+ Assert.NotEqual(IconHelper.GetAppIconPath(), path, StringComparer.OrdinalIgnoreCase);
+ }
+
[Fact]
public void CatalogFilePath_ResolvesToExistingBundledAsset()
{
From 2ff57fc0175a9b141a48aa333d47b4306b5d1c05 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 23:41:39 +0000
Subject: [PATCH 29/83] Add pre-response voice latency timing logs
---
.../Services/Voice/VoiceService.cs | 19 +++++++++++++++++++
.../Windows/WebChatWindow.xaml.cs | 5 ++++-
2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 26a1d5e..07a050a 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -706,6 +706,11 @@ private async Task HandleRecognizedTextAsync(string text)
{
CancellationToken cancellationToken;
string sessionKey;
+ var pipelineStopwatch = Stopwatch.StartNew();
+ long recognitionStopElapsedMs = 0;
+ long transportReadyElapsedMs = 0;
+ long traySubmitElapsedMs = 0;
+ long directSendElapsedMs = 0;
lock (_gate)
{
@@ -735,10 +740,12 @@ private async Task HandleRecognizedTextAsync(string text)
RaiseTranscriptDraft(text, sessionKey, clear: false);
await StopRecognitionSessionAsync();
+ recognitionStopElapsedMs = pipelineStopwatch.ElapsedMilliseconds;
try
{
await EnsureChatTransportAsync(cancellationToken);
+ transportReadyElapsedMs = pipelineStopwatch.ElapsedMilliseconds - recognitionStopElapsedMs;
OpenClawGatewayClient? client;
Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? transcriptSubmitter;
@@ -757,17 +764,25 @@ private async Task HandleRecognizedTextAsync(string text)
var submitOutcome = VoiceTranscriptSubmitOutcome.Unavailable;
if (transcriptSubmitter != null)
{
+ var submitStopwatch = Stopwatch.StartNew();
submitOutcome = await transcriptSubmitter(text, sessionKey);
+ traySubmitElapsedMs = submitStopwatch.ElapsedMilliseconds;
+ _logger.Info($"Voice tray submit path: outcome={submitOutcome} elapsed={traySubmitElapsedMs}ms");
}
if (submitOutcome == VoiceTranscriptSubmitOutcome.Unavailable)
{
+ var directSendStopwatch = Stopwatch.StartNew();
await client.SendChatMessageAsync(text, sessionKey);
+ directSendElapsedMs = directSendStopwatch.ElapsedMilliseconds;
submitOutcome = VoiceTranscriptSubmitOutcome.Submitted;
+ _logger.Info($"Voice direct send path: elapsed={directSendElapsedMs}ms");
}
if (submitOutcome == VoiceTranscriptSubmitOutcome.DeferredToUser)
{
+ _logger.Info(
+ $"Voice pre-response latency: recognitionStop={recognitionStopElapsedMs}ms transportReady={transportReadyElapsedMs}ms traySubmit={traySubmitElapsedMs}ms total={pipelineStopwatch.ElapsedMilliseconds}ms (deferred to user)");
lock (_gate)
{
_awaitingReply = false;
@@ -784,6 +799,8 @@ private async Task HandleRecognizedTextAsync(string text)
return;
}
+ _logger.Info(
+ $"Voice pre-response latency: recognitionStop={recognitionStopElapsedMs}ms transportReady={transportReadyElapsedMs}ms traySubmit={traySubmitElapsedMs}ms directSend={directSendElapsedMs}ms total={pipelineStopwatch.ElapsedMilliseconds}ms");
lock (_gate)
{
_awaitingReply = true;
@@ -796,6 +813,7 @@ private async Task HandleRecognizedTextAsync(string text)
_status.LastUtteranceUtc = DateTime.UtcNow;
}
+ _logger.Info("Voice response wait started");
RaiseConversationTurn(VoiceConversationDirection.Outgoing, text, sessionKey);
RaiseTranscriptDraft(string.Empty, sessionKey, clear: true);
_ = MonitorReplyTimeoutAsync(text, cancellationToken);
@@ -842,6 +860,7 @@ public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = nu
_status.LastUtteranceUtc = DateTime.UtcNow;
}
+ _logger.Info("Voice response wait started (manual submit)");
RaiseConversationTurn(VoiceConversationDirection.Outgoing, text, effectiveSessionKey);
RaiseTranscriptDraft(string.Empty, effectiveSessionKey, clear: true);
_ = MonitorReplyTimeoutAsync(text, cancellationToken);
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 1bd1acd..e772d62 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -569,10 +569,13 @@ public async Task<bool> TrySubmitVoiceTranscriptAsync(string text)
try
{
+ var stopwatch = Stopwatch.StartNew();
var textJson = JsonSerializer.Serialize(text ?? string.Empty);
var result = await WebView.CoreWebView2.ExecuteScriptAsync(
$"window.__openClawTrayVoice?.submitDraft?.({textJson}) ?? false;");
- return string.Equals(result, "true", StringComparison.OrdinalIgnoreCase);
+ var submitted = string.Equals(result, "true", StringComparison.OrdinalIgnoreCase);
+ Logger.Info($"WebChatWindow: Voice draft submit via chat UI {(submitted ? "succeeded" : "failed")} in {stopwatch.ElapsedMilliseconds}ms");
+ return submitted;
}
catch (Exception ex)
{
From ffa3fa234fb3420034d99789d201755fa6692af8 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 23 Mar 2026 23:45:16 +0000
Subject: [PATCH 30/83] Keep talk mode alive after input failures
---
.../Services/Voice/VoiceService.cs | 102 ++++++++++++++++--
1 file changed, 93 insertions(+), 9 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 07a050a..93f6692 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -29,6 +29,7 @@ public sealed class VoiceService : IVoiceRuntime, IDisposable
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
+ private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
@@ -604,6 +605,61 @@ private async Task StartRecognitionSessionAsync()
_logger.Info("Speech recognition session started");
}
+ private async Task ResumeRecognitionSessionAsync(
+ CancellationToken cancellationToken,
+ string reason,
+ string? lastError = null)
+ {
+ const int maxAttempts = 2;
+ string? currentError = lastError;
+
+ for (var attempt = 1; attempt <= maxAttempts; attempt++)
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+
+ try
+ {
+ await StartRecognitionSessionAsync();
+ return;
+ }
+ catch (OperationCanceledException)
+ {
+ throw;
+ }
+ catch (Exception ex)
+ {
+ currentError = GetUserFacingErrorMessage(ex);
+ _logger.Warn(
+ $"Voice recognition resume failed ({reason}, attempt {attempt}/{maxAttempts}): {ex.Message}");
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null ||
+ !_status.Running ||
+ _status.Mode != VoiceActivationMode.TalkMode ||
+ _awaitingReply ||
+ _isSpeaking)
+ {
+ return;
+ }
+
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ currentError);
+ }
+
+ if (attempt == maxAttempts)
+ {
+ return;
+ }
+
+ await Task.Delay(RecognitionResumeRetryDelay, cancellationToken);
+ }
+ }
+ }
+
private async Task StopRecognitionSessionAsync()
{
SpeechRecognizer? recognizer;
@@ -656,11 +712,38 @@ private async void OnSpeechResultGenerated(
catch (Exception ex)
{
_logger.Error("Voice recognition handler failed", ex);
+ CancellationToken cancellationToken;
+ var shouldResume = false;
+ var userMessage = GetUserFacingErrorMessage(ex);
lock (_gate)
{
- if (_status.Running)
+ if (_runtimeCts != null &&
+ _status.Running &&
+ _status.Mode == VoiceActivationMode.TalkMode)
+ {
+ cancellationToken = _runtimeCts.Token;
+ _awaitingReply = false;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ userMessage);
+ shouldResume = true;
+ }
+ else
+ {
+ return;
+ }
+ }
+
+ if (shouldResume)
+ {
+ try
+ {
+ await ResumeRecognitionSessionAsync(cancellationToken, "result handler failure", userMessage);
+ }
+ catch (OperationCanceledException)
{
- _status = BuildErrorStatus(VoiceActivationMode.TalkMode, _status.SessionKey, GetUserFacingErrorMessage(ex));
}
}
}
@@ -821,6 +904,7 @@ private async Task HandleRecognizedTextAsync(string text)
catch (Exception ex)
{
_logger.Error("Voice transcript submit failed", ex);
+ var userMessage = GetUserFacingErrorMessage(ex);
lock (_gate)
{
@@ -828,11 +912,11 @@ private async Task HandleRecognizedTextAsync(string text)
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- GetUserFacingErrorMessage(ex));
+ VoiceRuntimeState.Arming,
+ userMessage);
}
- await StartRecognitionSessionAsync();
+ await ResumeRecognitionSessionAsync(cancellationToken, "transcript submit failure", userMessage);
}
}
@@ -890,7 +974,7 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
if (shouldResume)
{
- await StartRecognitionSessionAsync();
+ await ResumeRecognitionSessionAsync(cancellationToken, "reply timeout");
}
}
catch (OperationCanceledException)
@@ -948,7 +1032,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
}
}
- await StartRecognitionSessionAsync();
+ await ResumeRecognitionSessionAsync(CancellationToken.None, "empty assistant reply");
return;
}
@@ -986,7 +1070,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
try
{
- await StartRecognitionSessionAsync();
+ await ResumeRecognitionSessionAsync(CancellationToken.None, "assistant reply playback completed");
}
catch (Exception ex)
{
@@ -1107,7 +1191,7 @@ private async void OnSpeechRecognitionCompleted(
if (shouldRestart && !token.IsCancellationRequested)
{
await Task.Delay(250, token);
- await StartRecognitionSessionAsync();
+ await ResumeRecognitionSessionAsync(token, $"recognition completed ({args.Status})");
}
}
catch (OperationCanceledException)
From c3ded30d479f25f6f52cf66e9a8e011a093949f4 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 00:10:16 +0000
Subject: [PATCH 31/83] Queue talk mode replies for sequential playback
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 144 ++++++++++++++----
.../VoiceServiceTransportTests.cs | 19 +++
3 files changed, 132 insertions(+), 32 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 172a5c0..4b65815 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -55,6 +55,7 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- otherwise, the transcript is sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
- the node performs local TTS playback of that reply
+- assistant replies are queued locally and spoken sequentially, with a short 500 ms pause between queued replies so overlapping responses are not lost
To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 93f6692..768a68c 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -30,6 +30,7 @@ public sealed class VoiceService : IVoiceRuntime, IDisposable
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
+ private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
@@ -48,10 +49,12 @@ public sealed class VoiceService : IVoiceRuntime, IDisposable
private bool _recognitionActive;
private bool _awaitingReply;
private bool _isSpeaking;
+ private bool _replyPlaybackLoopActive;
private bool _quickPaused;
private string? _lastTranscript;
private DateTime _lastTranscriptUtc;
private string? _pendingManualTranscript;
+ private readonly Queue<(string Text, string? SessionKey)> _pendingAssistantReplies = new();
private bool _disposed;
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
@@ -994,10 +997,11 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
}
string text;
+ bool shouldStartPlaybackLoop = false;
lock (_gate)
{
- if (!_awaitingReply || !_status.Running || _status.Mode != VoiceActivationMode.TalkMode)
+ if (!_status.Running || _status.Mode != VoiceActivationMode.TalkMode)
{
return;
}
@@ -1007,80 +1011,149 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
return;
}
+ if (!ShouldAcceptAssistantReply(_awaitingReply, _isSpeaking, _pendingAssistantReplies.Count))
+ {
+ return;
+ }
+
_awaitingReply = false;
- _isSpeaking = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.PlayingResponse,
- _status.LastError);
text = PrepareReplyForSpeech(args.Message);
}
if (string.IsNullOrWhiteSpace(text))
{
+ var shouldResumeRecognition = false;
lock (_gate)
{
- _isSpeaking = false;
- if (_status.Running)
+ if (_status.Running && !_replyPlaybackLoopActive)
{
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
VoiceRuntimeState.ListeningContinuously,
_status.LastError);
+ shouldResumeRecognition = true;
}
}
- await ResumeRecognitionSessionAsync(CancellationToken.None, "empty assistant reply");
+ if (shouldResumeRecognition)
+ {
+ await ResumeRecognitionSessionAsync(CancellationToken.None, "empty assistant reply");
+ }
return;
}
- try
- {
- RaiseConversationTurn(VoiceConversationDirection.Incoming, text, args.SessionKey);
- await SpeakTextAsync(text);
- }
- catch (Exception ex)
+ RaiseConversationTurn(VoiceConversationDirection.Incoming, text, args.SessionKey);
+
+ lock (_gate)
{
- _logger.Error("Voice reply playback failed", ex);
- lock (_gate)
+ _pendingAssistantReplies.Enqueue((text, args.SessionKey));
+ _logger.Info($"Voice reply queued: pending={_pendingAssistantReplies.Count}");
+
+ if (!_replyPlaybackLoopActive)
{
+ _replyPlaybackLoopActive = true;
+ _isSpeaking = true;
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- GetUserFacingErrorMessage(ex));
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
+ shouldStartPlaybackLoop = true;
}
}
- finally
+
+ if (shouldStartPlaybackLoop)
+ {
+ _ = ProcessQueuedAssistantRepliesAsync();
+ }
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ }
+ }
+
+ private async Task ProcessQueuedAssistantRepliesAsync()
+ {
+ try
+ {
+ while (true)
{
+ (string Text, string? SessionKey) reply;
+ var shouldPauseBeforeNextReply = false;
+
lock (_gate)
{
- _isSpeaking = false;
- if (_status.Running)
+ if (_pendingAssistantReplies.Count == 0)
{
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- _status.LastError);
+ _replyPlaybackLoopActive = false;
+ _isSpeaking = false;
+
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ }
+
+ break;
}
+
+ reply = _pendingAssistantReplies.Dequeue();
+ shouldPauseBeforeNextReply = _pendingAssistantReplies.Count > 0;
}
try
{
- await ResumeRecognitionSessionAsync(CancellationToken.None, "assistant reply playback completed");
+ await SpeakTextAsync(reply.Text);
}
catch (Exception ex)
{
- _logger.Warn($"Voice recognition resume failed: {ex.Message}");
+ _logger.Error("Voice reply playback failed", ex);
+ lock (_gate)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ shouldPauseBeforeNextReply ? VoiceRuntimeState.PlayingResponse : VoiceRuntimeState.ListeningContinuously,
+ GetUserFacingErrorMessage(ex));
+ }
+ }
+
+ if (shouldPauseBeforeNextReply)
+ {
+ _logger.Info($"Voice reply playback paused before next queued response ({QueuedReplyPlaybackGap.TotalMilliseconds}ms)");
+ await Task.Delay(QueuedReplyPlaybackGap);
}
}
}
- catch (Exception ex)
+ finally
{
- _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ lock (_gate)
+ {
+ _replyPlaybackLoopActive = false;
+ _isSpeaking = false;
+ if (_status.Running)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ }
+ }
+
+ try
+ {
+ await ResumeRecognitionSessionAsync(CancellationToken.None, "queued assistant reply playback completed");
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice recognition resume failed: {ex.Message}");
+ }
}
}
@@ -1131,6 +1204,11 @@ private static bool UsesCloudTextToSpeechRuntime(VoiceProviderOption provider)
return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
}
+ internal static bool ShouldAcceptAssistantReply(bool awaitingReply, bool isSpeaking, int queuedReplyCount)
+ {
+ return awaitingReply || isSpeaking || queuedReplyCount > 0;
+ }
+
private static async Task PlayStreamAsync(
MediaPlayer player,
IRandomAccessStream stream,
@@ -1284,7 +1362,9 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_awaitingReply = false;
_isSpeaking = false;
+ _replyPlaybackLoopActive = false;
_pendingManualTranscript = null;
+ _pendingAssistantReplies.Clear();
}
try { runtimeCts?.Cancel(); } catch { }
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 2686278..fdd5c75 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -66,6 +66,25 @@ public void UsesCloudTextToSpeechRuntime_ReturnsTrueForWebSocketProviders()
Assert.True(result);
}
+ [Theory]
+ [InlineData(true, false, 0, true)]
+ [InlineData(false, true, 0, true)]
+ [InlineData(false, false, 1, true)]
+ [InlineData(false, false, 0, false)]
+ public void ShouldAcceptAssistantReply_MatchesPlaybackAndAwaitingState(
+ bool awaitingReply,
+ bool isSpeaking,
+ int queuedReplyCount,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldAcceptAssistantReply",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ var result = (bool)method.Invoke(null, [awaitingReply, isSpeaking, queuedReplyCount])!;
+
+ Assert.Equal(expected, result);
+ }
+
private static MethodInfo GetMethod()
{
return typeof(VoiceService).GetMethod(
From 82e295879529b44abffb84a1f93e437dbbe388a8 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 00:32:42 +0000
Subject: [PATCH 32/83] Add voice control and configuration APIs
---
docs/VOICE-MODE.md | 107 ++++++++++
.../Capabilities/VoiceCapability.cs | 72 +++++++
src/OpenClaw.Shared/VoiceModeSchema.cs | 26 ++-
src/OpenClaw.Tray.WinUI/App.xaml.cs | 2 +-
.../Controls/VoiceSettingsPanel.xaml.cs | 29 ++-
.../Services/NodeService.cs | 27 +++
.../Services/Voice/VoiceChatContracts.cs | 22 ++
.../Services/Voice/VoiceService.cs | 194 +++++++++++++++++-
.../Windows/SettingsWindow.xaml.cs | 12 +-
.../Windows/VoiceModeWindow.xaml.cs | 18 +-
.../VoiceModeSchemaTests.cs | 8 +-
11 files changed, 487 insertions(+), 30 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 4b65815..ce75158 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -12,6 +12,40 @@ This document defines the voice subsystem for the Windows node only. It introduc
- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
- Reuse the existing node capability pattern instead of introducing a parallel control path
+## Feature List
+
+### Story: Full-duplex / barge-in Talk Mode
+
+Allow the node to keep listening while it is speaking, so the user can interrupt or interleave speech without waiting for reply playback to finish.
+
+Notes:
+
+- the current Windows implementation is half-duplex: recognition is stopped or ignored while a reply is being spoken
+- a true implementation will require a lower-level audio pipeline rather than only `SpeechRecognizer` plus `SpeechSynthesizer`
+- practical requirements are likely to include:
+ - microphone capture that can remain active during playback
+ - acoustic echo cancellation / echo suppression
+ - barge-in detection and playback interruption rules
+ - a policy for whether interrupt speech cancels the current reply or queues behind it
+ - additional runtime control/status so the UI can show when barge-in is armed
+- this should be treated as a separate engineering phase, not a small extension of the current Talk Mode runtime
+
+### Story: Compact Voice Status Strip
+
+Add an optional tiny always-on-top voice strip window for Talk Mode.
+
+Notes:
+
+- user-configurable show / hide
+- intended to be a minimal one-line-high display with a small amount of padding
+- should show:
+ - current voice state
+ - rolling live transcript while listening
+ - rolling assistant text while speaking
+ - a skip / cut-off control while speaking
+- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
+- if implemented later, the strip should use the shared runtime control API described below
+
## Non-Goals
- True full-duplex or chunk-streaming audio transport between node and gateway
@@ -63,6 +97,79 @@ That means the first Windows target is transcript transport, not raw audio uploa
The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `TalkMode`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
+## Voice APIs
+
+The Windows tray implementation now has two API layers:
+
+- shared node-capability commands in `OpenClaw.Shared`
+- in-process tray interfaces used by the windows/forms
+
+### Shared Capability Commands
+
+The node capability command surface is:
+
+- `voice.devices.list`
+- `voice.settings.get`
+- `voice.settings.set`
+- `voice.status.get`
+- `voice.start`
+- `voice.stop`
+- `voice.pause`
+- `voice.resume`
+- `voice.skip`
+
+These commands are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/VoiceModeSchema.cs) and handled by [VoiceCapability.cs](../src/OpenClaw.Shared/Capabilities/VoiceCapability.cs).
+
+`voice.settings.get` / `voice.settings.set` are the configuration API.
+
+`voice.start` / `voice.stop` / `voice.pause` / `voice.resume` / `voice.skip` are the runtime control API.
+
+### Status Surface
+
+`VoiceStatusInfo` now carries the basic state needed by control surfaces:
+
+- mode
+- runtime state
+- session key
+- input/output device ids
+- last wake / last utterance timestamps
+- pending reply count
+- whether a reply can currently be skipped
+- current reply preview
+- last error
+
+### In-Process Tray Interfaces
+
+The tray app also exposes in-process interfaces so its own windows do not need to bind directly to the concrete `VoiceService` implementation:
+
+- `IVoiceConfigurationApi`
+ - get voice settings
+ - update voice settings
+ - list devices
+ - get provider catalog
+ - get/set provider configuration
+- `IVoiceRuntimeControlApi`
+ - get runtime status
+ - start / stop
+ - pause / resume
+ - skip current reply
+- `IVoiceRuntime`
+ - transcript draft and conversation events for chat integration
+
+This is the intended base for future surfaces such as the compact voice strip.
+
+### Can the Settings Form Use This API?
+
+Yes. The Settings form can use the configuration API cleanly.
+
+The current tray implementation now uses the voice configuration interface for:
+
+- provider catalog loading
+- device enumeration
+- applying updated voice settings / provider configuration on save
+
+That means the settings UI is no longer hard-wired only to concrete `VoiceService` internals for its voice-specific behavior.
+
## Speech Output Latency
Microsoft's Azure Speech SDK latency guidance is specifically about speech synthesis, not speech recognition, so it applies to Windows voice output rather than voice input. Source: [Lower speech synthesis latency using Speech SDK](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-csharp).
diff --git a/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs b/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
index 37d98fa..fe1ddea 100644
--- a/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
+++ b/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
@@ -22,6 +22,9 @@ public class VoiceCapability : NodeCapabilityBase
public event Func<Task<VoiceStatusInfo>>? StatusRequested;
public event Func<VoiceStartArgs, Task<VoiceStatusInfo>>? StartRequested;
public event Func<VoiceStopArgs, Task<VoiceStatusInfo>>? StopRequested;
+ public event Func<VoicePauseArgs, Task<VoiceStatusInfo>>? PauseRequested;
+ public event Func<VoiceResumeArgs, Task<VoiceStatusInfo>>? ResumeRequested;
+ public event Func<VoiceSkipArgs, Task<VoiceStatusInfo>>? SkipRequested;
public VoiceCapability(IOpenClawLogger logger) : base(logger)
{
@@ -37,6 +40,9 @@ public override async Task<NodeInvokeResponse> ExecuteAsync(NodeInvokeRequest re
VoiceCommands.GetStatus => await HandleGetStatusAsync(),
VoiceCommands.Start => await HandleStartAsync(request),
VoiceCommands.Stop => await HandleStopAsync(request),
+ VoiceCommands.Pause => await HandlePauseAsync(request),
+ VoiceCommands.Resume => await HandleResumeAsync(request),
+ VoiceCommands.Skip => await HandleSkipAsync(request),
_ => Error($"Unknown command: {request.Command}")
};
}
@@ -171,4 +177,70 @@ private async Task<NodeInvokeResponse> HandleStopAsync(NodeInvokeRequest request
return Error($"Stop failed: {ex.Message}");
}
}
+
+ private async Task<NodeInvokeResponse> HandlePauseAsync(NodeInvokeRequest request)
+ {
+ Logger.Info(VoiceCommands.Pause);
+
+ if (PauseRequested == null)
+ return Error("Voice pause not available");
+
+ try
+ {
+ var rawArgs = request.Args.ValueKind is JsonValueKind.Undefined or JsonValueKind.Null
+ ? "{}"
+ : request.Args.GetRawText();
+ var args = JsonSerializer.Deserialize<VoicePauseArgs>(rawArgs, s_jsonOptions) ?? new VoicePauseArgs();
+ return Success(await PauseRequested(args));
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice pause failed", ex);
+ return Error($"Pause failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleResumeAsync(NodeInvokeRequest request)
+ {
+ Logger.Info(VoiceCommands.Resume);
+
+ if (ResumeRequested == null)
+ return Error("Voice resume not available");
+
+ try
+ {
+ var rawArgs = request.Args.ValueKind is JsonValueKind.Undefined or JsonValueKind.Null
+ ? "{}"
+ : request.Args.GetRawText();
+ var args = JsonSerializer.Deserialize<VoiceResumeArgs>(rawArgs, s_jsonOptions) ?? new VoiceResumeArgs();
+ return Success(await ResumeRequested(args));
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice resume failed", ex);
+ return Error($"Resume failed: {ex.Message}");
+ }
+ }
+
+ private async Task<NodeInvokeResponse> HandleSkipAsync(NodeInvokeRequest request)
+ {
+ Logger.Info(VoiceCommands.Skip);
+
+ if (SkipRequested == null)
+ return Error("Voice skip not available");
+
+ try
+ {
+ var rawArgs = request.Args.ValueKind is JsonValueKind.Undefined or JsonValueKind.Null
+ ? "{}"
+ : request.Args.GetRawText();
+ var args = JsonSerializer.Deserialize<VoiceSkipArgs>(rawArgs, s_jsonOptions) ?? new VoiceSkipArgs();
+ return Success(await SkipRequested(args));
+ }
+ catch (Exception ex)
+ {
+ Logger.Error("Voice skip failed", ex);
+ return Error($"Skip failed: {ex.Message}");
+ }
+ }
}
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index da4f759..8b2f542 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -13,6 +13,9 @@ public static class VoiceCommands
public const string GetStatus = "voice.status.get";
public const string Start = "voice.start";
public const string Stop = "voice.stop";
+ public const string Pause = "voice.pause";
+ public const string Resume = "voice.resume";
+ public const string Skip = "voice.skip";
private static readonly ReadOnlyCollection<string> s_all = Array.AsReadOnly(
[
@@ -21,7 +24,10 @@ public static class VoiceCommands
SetSettings,
GetStatus,
Start,
- Stop
+ Stop,
+ Pause,
+ Resume,
+ Skip
]);
public static IReadOnlyList<string> All => s_all;
@@ -115,6 +121,9 @@ public sealed class VoiceStatusInfo
public bool VoiceWakeLoaded { get; set; }
public DateTime? LastVoiceWakeUtc { get; set; }
public DateTime? LastUtteranceUtc { get; set; }
+ public int PendingReplyCount { get; set; }
+ public bool CanSkipReply { get; set; }
+ public string? CurrentReplyPreview { get; set; }
public string? LastError { get; set; }
}
@@ -129,6 +138,21 @@ public sealed class VoiceStopArgs
public string? Reason { get; set; }
}
+public sealed class VoicePauseArgs
+{
+ public string? Reason { get; set; }
+}
+
+public sealed class VoiceResumeArgs
+{
+ public string? Reason { get; set; }
+}
+
+public sealed class VoiceSkipArgs
+{
+ public string? Reason { get; set; }
+}
+
public sealed class VoiceSettingsUpdateArgs
{
public VoiceSettings Settings { get; set; } = new();
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 8ecb2ff..eae2e21 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -1734,7 +1734,7 @@ private void ShowVoiceModeSettings()
if (_voiceModeWindow == null || _voiceModeWindow.IsClosed)
{
- _voiceModeWindow = new VoiceModeWindow(_settings, _voiceService);
+ _voiceModeWindow = new VoiceModeWindow(_settings, _voiceService, _voiceService);
_voiceModeWindow.OpenSettingsRequested += (s, e) => ShowSettings();
_voiceModeWindow.Closed += (s, e) => _voiceModeWindow = null;
}
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index d16ea54..1d2903d 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -13,7 +13,7 @@ namespace OpenClawTray.Controls;
public sealed partial class VoiceSettingsPanel : UserControl
{
private SettingsManager? _settings;
- private VoiceService? _voiceService;
+ private IVoiceConfigurationApi? _voiceConfigurationApi;
private VoiceProviderConfigurationStore _voiceProviderConfigurationDraft = new();
private string _activeTtsProviderId = VoiceProviderIds.Windows;
private bool _updatingVoiceProviderFields;
@@ -28,20 +28,20 @@ public VoiceSettingsPanel()
InitializeComponent();
}
- public void Initialize(SettingsManager settings, VoiceService voiceService)
+ public void Initialize(SettingsManager settings, IVoiceConfigurationApi voiceConfigurationApi)
{
_settings = settings;
- _voiceService = voiceService;
+ _voiceConfigurationApi = voiceConfigurationApi;
LoadVoiceSettings();
_ = LoadVoiceDevicesAsync();
}
- public void ApplyTo(SettingsManager settings)
+ public async Task ApplyAsync(SettingsManager settings)
{
CaptureSelectedVoiceProviderSettings();
- settings.Voice = new VoiceSettings
+ var voiceSettings = new VoiceSettings
{
Mode = GetSelectedVoiceMode(),
Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
@@ -70,12 +70,23 @@ public void ApplyTo(SettingsManager settings)
ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
}
};
+ settings.Voice = voiceSettings;
settings.VoiceProviderConfiguration = _voiceProviderConfigurationDraft.Clone();
+
+ if (_voiceConfigurationApi != null)
+ {
+ _voiceConfigurationApi.SetProviderConfiguration(_voiceProviderConfigurationDraft);
+ await _voiceConfigurationApi.UpdateSettingsAsync(new VoiceSettingsUpdateArgs
+ {
+ Settings = voiceSettings,
+ Persist = false
+ });
+ }
}
private void LoadVoiceSettings()
{
- if (_settings == null || _voiceService == null)
+ if (_settings == null || _voiceConfigurationApi == null)
{
return;
}
@@ -91,7 +102,7 @@ private void LoadVoiceSettings()
private void LoadVoiceProviders()
{
- var catalog = _voiceService!.GetProviderCatalog();
+ var catalog = _voiceConfigurationApi!.GetProviderCatalog();
_speechToTextOptions = catalog.SpeechToTextProviders
.Select(Clone)
@@ -113,7 +124,7 @@ private void LoadVoiceProviders()
private async Task LoadVoiceDevicesAsync()
{
- if (_settings == null || _voiceService == null)
+ if (_settings == null || _voiceConfigurationApi == null)
{
return;
}
@@ -121,7 +132,7 @@ private async Task LoadVoiceDevicesAsync()
try
{
VoiceSettingsInfoTextBlock.Text = "Loading voice devices...";
- var devices = await _voiceService.ListDevicesAsync();
+ var devices = await _voiceConfigurationApi.ListDevicesAsync();
_inputOptions =
[
diff --git a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
index 6869eb7..1533caa 100644
--- a/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/NodeService.cs
@@ -199,6 +199,9 @@ private void RegisterCapabilities()
_voiceCapability.StatusRequested += OnVoiceGetStatus;
_voiceCapability.StartRequested += OnVoiceStart;
_voiceCapability.StopRequested += OnVoiceStop;
+ _voiceCapability.PauseRequested += OnVoicePause;
+ _voiceCapability.ResumeRequested += OnVoiceResume;
+ _voiceCapability.SkipRequested += OnVoiceSkip;
_nodeClient.RegisterCapability(_voiceCapability);
_logger.Info("All capabilities registered");
@@ -592,6 +595,30 @@ private Task<VoiceStatusInfo> OnVoiceStop(VoiceStopArgs args)
return _voiceService.StopAsync(args);
}
+ private Task<VoiceStatusInfo> OnVoicePause(VoicePauseArgs args)
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.PauseAsync(args);
+ }
+
+ private Task<VoiceStatusInfo> OnVoiceResume(VoiceResumeArgs args)
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.ResumeAsync(args);
+ }
+
+ private Task<VoiceStatusInfo> OnVoiceSkip(VoiceSkipArgs args)
+ {
+ if (_voiceService == null)
+ throw new InvalidOperationException("Voice service not available");
+
+ return _voiceService.SkipCurrentReplyAsync(args);
+ }
+
#endregion
public void Dispose()
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
index 17d7920..fb12212 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
@@ -32,6 +32,28 @@ public interface IVoiceRuntime
void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null);
}
+public interface IVoiceConfigurationApi
+{
+ Task<VoiceSettings> GetSettingsAsync();
+ Task<VoiceSettings> UpdateSettingsAsync(VoiceSettingsUpdateArgs update);
+ Task<VoiceAudioDeviceInfo[]> ListDevicesAsync();
+ VoiceProviderCatalog GetProviderCatalog();
+ VoiceProviderConfigurationStore GetProviderConfiguration();
+ void SetProviderConfiguration(VoiceProviderConfigurationStore configurationStore);
+}
+
+public interface IVoiceRuntimeControlApi
+{
+ VoiceStatusInfo CurrentStatus { get; }
+ Task<VoiceStatusInfo> GetStatusAsync();
+ Task<VoiceStatusInfo> StartAsync(VoiceStartArgs args);
+ Task<VoiceStatusInfo> StopAsync(VoiceStopArgs args);
+ Task<VoiceStatusInfo> PauseAsync(VoicePauseArgs? args = null);
+ Task<VoiceStatusInfo> ResumeAsync(VoiceResumeArgs? args = null);
+ Task<VoiceStatusInfo> SkipCurrentReplyAsync(VoiceSkipArgs? args = null);
+ Task<VoiceStatusInfo> ToggleQuickPauseAsync();
+}
+
public interface IVoiceChatWindow
{
bool IsClosed { get; }
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 768a68c..8b28295 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -22,7 +22,7 @@
namespace OpenClawTray.Services.Voice;
-public sealed class VoiceService : IVoiceRuntime, IDisposable
+public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoiceRuntimeControlApi, IDisposable
{
private const string DefaultSessionKey = "main";
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
@@ -55,6 +55,8 @@ public sealed class VoiceService : IVoiceRuntime, IDisposable
private DateTime _lastTranscriptUtc;
private string? _pendingManualTranscript;
private readonly Queue<(string Text, string? SessionKey)> _pendingAssistantReplies = new();
+ private CancellationTokenSource? _playbackSkipCts;
+ private string? _currentReplyPreview;
private bool _disposed;
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
@@ -130,6 +132,24 @@ public Task<VoiceSettings> UpdateSettingsAsync(VoiceSettingsUpdateArgs update)
}
}
+ public VoiceProviderConfigurationStore GetProviderConfiguration()
+ {
+ lock (_gate)
+ {
+ return _settings.VoiceProviderConfiguration.Clone();
+ }
+ }
+
+ public void SetProviderConfiguration(VoiceProviderConfigurationStore configurationStore)
+ {
+ ArgumentNullException.ThrowIfNull(configurationStore);
+
+ lock (_gate)
+ {
+ _settings.VoiceProviderConfiguration = configurationStore.Clone();
+ }
+ }
+
public Task<VoiceStatusInfo> GetStatusAsync()
{
lock (_gate)
@@ -372,6 +392,95 @@ public void Dispose()
}
}
+ public async Task<VoiceStatusInfo> PauseAsync(VoicePauseArgs? args = null)
+ {
+ ObjectDisposedException.ThrowIf(_disposed, this);
+ args ??= new VoicePauseArgs();
+
+ VoiceActivationMode mode;
+ string? sessionKey;
+
+ lock (_gate)
+ {
+ mode = _runtimeModeOverride ?? _settings.Voice.Mode;
+ sessionKey = _status.SessionKey;
+
+ if (!_settings.Voice.Enabled || mode == VoiceActivationMode.Off)
+ {
+ _quickPaused = false;
+ _status = BuildStoppedStatus(sessionKey, "Voice mode is disabled");
+ return Clone(_status);
+ }
+
+ if (_quickPaused || _status.State == VoiceRuntimeState.Paused)
+ {
+ return Clone(_status);
+ }
+
+ _quickPaused = true;
+ }
+
+ await StopRuntimeResourcesAsync(updateStoppedStatus: false);
+
+ lock (_gate)
+ {
+ _status = BuildPausedStatus(mode, sessionKey, args.Reason);
+ _logger.Info($"Voice runtime paused{(string.IsNullOrWhiteSpace(args.Reason) ? string.Empty : $": {args.Reason}")}");
+ return Clone(_status);
+ }
+ }
+
+ public async Task<VoiceStatusInfo> ResumeAsync(VoiceResumeArgs? args = null)
+ {
+ ObjectDisposedException.ThrowIf(_disposed, this);
+ args ??= new VoiceResumeArgs();
+
+ VoiceActivationMode mode;
+ string? sessionKey;
+
+ lock (_gate)
+ {
+ mode = _runtimeModeOverride ?? _settings.Voice.Mode;
+ sessionKey = _status.SessionKey;
+ _quickPaused = false;
+ }
+
+ var resumed = await StartAsync(new VoiceStartArgs
+ {
+ Mode = mode,
+ SessionKey = sessionKey
+ });
+
+ _logger.Info($"Voice runtime resumed{(string.IsNullOrWhiteSpace(args.Reason) ? string.Empty : $": {args.Reason}")}");
+ return resumed;
+ }
+
+ public async Task<VoiceStatusInfo> SkipCurrentReplyAsync(VoiceSkipArgs? args = null)
+ {
+ args ??= new VoiceSkipArgs();
+
+ CancellationTokenSource? playbackSkipCts;
+
+ lock (_gate)
+ {
+ playbackSkipCts = _playbackSkipCts;
+ if (playbackSkipCts == null && _pendingAssistantReplies.Count == 0)
+ {
+ return Clone(_status);
+ }
+ }
+
+ playbackSkipCts?.Cancel();
+
+ await Task.Yield();
+
+ lock (_gate)
+ {
+ _logger.Info($"Voice reply skipped{(string.IsNullOrWhiteSpace(args.Reason) ? string.Empty : $": {args.Reason}")}");
+ return Clone(_status);
+ }
+ }
+
private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? sessionKey)
{
var effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey;
@@ -1061,6 +1170,14 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
_status.LastError);
shouldStartPlaybackLoop = true;
}
+ else
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
+ }
}
if (shouldStartPlaybackLoop)
@@ -1082,6 +1199,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
{
(string Text, string? SessionKey) reply;
var shouldPauseBeforeNextReply = false;
+ CancellationTokenSource? playbackSkipCts = null;
lock (_gate)
{
@@ -1089,6 +1207,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
{
_replyPlaybackLoopActive = false;
_isSpeaking = false;
+ _currentReplyPreview = null;
if (_status.Running)
{
@@ -1104,11 +1223,23 @@ private async Task ProcessQueuedAssistantRepliesAsync()
reply = _pendingAssistantReplies.Dequeue();
shouldPauseBeforeNextReply = _pendingAssistantReplies.Count > 0;
+ _currentReplyPreview = CreateReplyPreview(reply.Text);
+ _isSpeaking = true;
+ _playbackSkipCts = playbackSkipCts = new CancellationTokenSource();
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
}
try
{
- await SpeakTextAsync(reply.Text);
+ await SpeakTextAsync(reply.Text, playbackSkipCts.Token);
+ }
+ catch (OperationCanceledException)
+ {
+ _logger.Info($"Voice reply playback canceled: remainingQueue={CurrentStatus.PendingReplyCount}");
}
catch (Exception ex)
{
@@ -1122,6 +1253,20 @@ private async Task ProcessQueuedAssistantRepliesAsync()
GetUserFacingErrorMessage(ex));
}
}
+ finally
+ {
+ lock (_gate)
+ {
+ if (ReferenceEquals(_playbackSkipCts, playbackSkipCts))
+ {
+ _playbackSkipCts = null;
+ }
+
+ _currentReplyPreview = null;
+ }
+
+ playbackSkipCts?.Dispose();
+ }
if (shouldPauseBeforeNextReply)
{
@@ -1136,6 +1281,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
{
_replyPlaybackLoopActive = false;
_isSpeaking = false;
+ _currentReplyPreview = null;
if (_status.Running)
{
_status = BuildRunningStatus(
@@ -1157,7 +1303,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
}
}
- private async Task SpeakTextAsync(string text)
+ private async Task SpeakTextAsync(string text, CancellationToken cancellationToken)
{
VoiceSettings settings;
VoiceProviderConfigurationStore providerConfiguration;
@@ -1184,7 +1330,7 @@ private async Task SpeakTextAsync(string text)
if (UsesCloudTextToSpeechRuntime(provider))
{
using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger);
- await PlayStreamAsync(player, result.Stream, result.ContentType);
+ await PlayStreamAsync(player, result.Stream, result.ContentType, cancellationToken);
return;
}
@@ -1196,7 +1342,7 @@ private async Task SpeakTextAsync(string text)
var stopwatch = Stopwatch.StartNew();
using var stream = await synthesizer.SynthesizeTextToStreamAsync(text);
_logger.Info($"Windows TTS latency: total={stopwatch.ElapsedMilliseconds}ms");
- await PlayStreamAsync(player, stream, stream.ContentType);
+ await PlayStreamAsync(player, stream, stream.ContentType, cancellationToken);
}
private static bool UsesCloudTextToSpeechRuntime(VoiceProviderOption provider)
@@ -1209,10 +1355,22 @@ internal static bool ShouldAcceptAssistantReply(bool awaitingReply, bool isSpeak
return awaitingReply || isSpeaking || queuedReplyCount > 0;
}
+ private static string CreateReplyPreview(string text)
+ {
+ var trimmed = text.Trim();
+ if (trimmed.Length <= 120)
+ {
+ return trimmed;
+ }
+
+ return $"{trimmed[..117]}...";
+ }
+
private static async Task PlayStreamAsync(
MediaPlayer player,
IRandomAccessStream stream,
- string contentType)
+ string contentType,
+ CancellationToken cancellationToken)
{
stream.Seek(0);
var playbackEnded = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
@@ -1225,6 +1383,12 @@ private static async Task PlayStreamAsync(
player.MediaEnded += endedHandler;
player.MediaFailed += failedHandler;
+ using var registration = cancellationToken.Register(() =>
+ {
+ try { player.Pause(); } catch { }
+ try { player.Source = null; } catch { }
+ playbackEnded.TrySetCanceled(cancellationToken);
+ });
try
{
@@ -1334,6 +1498,7 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
{
CancellationTokenSource? runtimeCts;
+ CancellationTokenSource? playbackSkipCts;
OpenClawGatewayClient? chatClient;
SpeechRecognizer? recognizer;
SpeechSynthesizer? synthesizer;
@@ -1365,9 +1530,13 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_replyPlaybackLoopActive = false;
_pendingManualTranscript = null;
_pendingAssistantReplies.Clear();
+ _currentReplyPreview = null;
+ playbackSkipCts = _playbackSkipCts;
+ _playbackSkipCts = null;
}
try { runtimeCts?.Cancel(); } catch { }
+ try { playbackSkipCts?.Cancel(); } catch { }
if (recognizer != null)
{
@@ -1397,6 +1566,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
}
try { runtimeCts?.Dispose(); } catch { }
+ try { playbackSkipCts?.Dispose(); } catch { }
if (updateStoppedStatus)
{
@@ -1491,6 +1661,9 @@ private VoiceStatusInfo BuildRunningStatus(
VoiceWakeLoaded = mode == VoiceActivationMode.VoiceWake,
LastVoiceWakeUtc = _status.LastVoiceWakeUtc,
LastUtteranceUtc = _status.LastUtteranceUtc,
+ PendingReplyCount = _pendingAssistantReplies.Count,
+ CanSkipReply = _isSpeaking || _pendingAssistantReplies.Count > 0,
+ CurrentReplyPreview = _currentReplyPreview,
LastError = lastError
};
}
@@ -1511,6 +1684,9 @@ private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
VoiceWakeLoaded = false,
LastVoiceWakeUtc = _status.LastVoiceWakeUtc,
LastUtteranceUtc = _status.LastUtteranceUtc,
+ PendingReplyCount = _pendingAssistantReplies.Count,
+ CanSkipReply = _isSpeaking || _pendingAssistantReplies.Count > 0,
+ CurrentReplyPreview = _currentReplyPreview,
LastError = reason
};
}
@@ -1531,6 +1707,9 @@ private VoiceStatusInfo BuildPausedStatus(VoiceActivationMode mode, string? sess
VoiceWakeLoaded = false,
LastVoiceWakeUtc = _status.LastVoiceWakeUtc,
LastUtteranceUtc = _status.LastUtteranceUtc,
+ PendingReplyCount = _pendingAssistantReplies.Count,
+ CanSkipReply = _isSpeaking || _pendingAssistantReplies.Count > 0,
+ CurrentReplyPreview = _currentReplyPreview,
LastError = reason
};
}
@@ -1590,6 +1769,9 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
VoiceWakeLoaded = source.VoiceWakeLoaded,
LastVoiceWakeUtc = source.LastVoiceWakeUtc,
LastUtteranceUtc = source.LastUtteranceUtc,
+ PendingReplyCount = source.PendingReplyCount,
+ CanSkipReply = source.CanSkipReply,
+ CurrentReplyPreview = source.CurrentReplyPreview,
LastError = source.LastError
};
}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
index 9e442e9..eb686a8 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
@@ -19,7 +19,7 @@ public sealed partial class SettingsWindow : WindowEx
public event EventHandler? SettingsSaved;
- public SettingsWindow(SettingsManager settings, VoiceService voiceService)
+ public SettingsWindow(SettingsManager settings, IVoiceConfigurationApi voiceConfigurationApi)
{
_settings = settings;
InitializeComponent();
@@ -31,7 +31,7 @@ public SettingsWindow(SettingsManager settings, VoiceService voiceService)
this.SetIcon(IconHelper.GetStatusIconPath(ConnectionStatus.Connected));
LoadSettings();
- VoiceSettingsPanel.Initialize(_settings, voiceService);
+ VoiceSettingsPanel.Initialize(_settings, voiceConfigurationApi);
Closed += (s, e) => IsClosed = true;
@@ -72,7 +72,7 @@ private void LoadSettings()
NodeModeToggle.IsOn = _settings.EnableNodeMode;
}
- private void SaveSettings()
+ private async Task SaveSettingsAsync()
{
_settings.GatewayUrl = GatewayUrlTextBox.Text.Trim();
_settings.Token = TokenTextBox.Text.Trim();
@@ -95,7 +95,7 @@ private void SaveSettings()
_settings.NotifyInfo = NotifyInfoCb.IsChecked ?? true;
_settings.EnableNodeMode = NodeModeToggle.IsOn;
- VoiceSettingsPanel.ApplyTo(_settings);
+ await VoiceSettingsPanel.ApplyAsync(_settings);
_settings.Save();
AutoStartManager.SetAutoStart(_settings.AutoStart);
@@ -187,7 +187,7 @@ private void OnTestNotification(object sender, RoutedEventArgs e)
}
}
- private void OnSave(object sender, RoutedEventArgs e)
+ private async void OnSave(object sender, RoutedEventArgs e)
{
var gatewayUrl = GatewayUrlTextBox.Text.Trim();
if (!GatewayUrlHelper.IsValidGatewayUrl(gatewayUrl))
@@ -200,7 +200,7 @@ private void OnSave(object sender, RoutedEventArgs e)
var oldGateway = _settings.GatewayUrl;
var oldAutoStart = _settings.AutoStart;
var oldNodeMode = _settings.EnableNodeMode;
- SaveSettings();
+ await SaveSettingsAsync();
if (!string.Equals(oldGateway, _settings.GatewayUrl, StringComparison.Ordinal))
{
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index 1029c90..cac4543 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -13,16 +13,21 @@ namespace OpenClawTray.Windows;
public sealed partial class VoiceModeWindow : WindowEx
{
private readonly SettingsManager _settings;
- private readonly VoiceService _voiceService;
+ private readonly IVoiceRuntimeControlApi _voiceRuntimeControlApi;
+ private readonly IVoiceConfigurationApi _voiceConfigurationApi;
public bool IsClosed { get; private set; }
public event EventHandler? OpenSettingsRequested;
- public VoiceModeWindow(SettingsManager settings, VoiceService voiceService)
+ public VoiceModeWindow(
+ SettingsManager settings,
+ IVoiceRuntimeControlApi voiceRuntimeControlApi,
+ IVoiceConfigurationApi voiceConfigurationApi)
{
_settings = settings;
- _voiceService = voiceService;
+ _voiceRuntimeControlApi = voiceRuntimeControlApi;
+ _voiceConfigurationApi = voiceConfigurationApi;
InitializeComponent();
@@ -38,8 +43,8 @@ public VoiceModeWindow(SettingsManager settings, VoiceService voiceService)
public void RefreshStatus()
{
- var running = _voiceService.CurrentStatus;
- var catalog = _voiceService.GetProviderCatalog();
+ var running = _voiceRuntimeControlApi.CurrentStatus;
+ var catalog = _voiceConfigurationApi.GetProviderCatalog();
StatusItemsControl.ItemsSource = new List<DetailRow>
{
@@ -47,7 +52,8 @@ public void RefreshStatus()
new("Runtime", VoiceDisplayHelper.GetRuntimeLabel(running)),
new("Node Mode", _settings.EnableNodeMode ? "Enabled" : "Disabled"),
new("Session", string.IsNullOrWhiteSpace(running.SessionKey) ? "main" : running.SessionKey!),
- new("State", VoiceDisplayHelper.GetStateLabel(running.State))
+ new("State", VoiceDisplayHelper.GetStateLabel(running.State)),
+ new("Queued replies", running.PendingReplyCount.ToString())
};
ConfigurationItemsControl.ItemsSource = new List<DetailRow>
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index a140140..d0edc0d 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -15,7 +15,10 @@ public void All_ContainsExpectedCommandsInStableOrder()
"voice.settings.set",
"voice.status.get",
"voice.start",
- "voice.stop"
+ "voice.stop",
+ "voice.pause",
+ "voice.resume",
+ "voice.skip"
],
VoiceCommands.All);
}
@@ -53,6 +56,9 @@ public void VoiceStatusInfo_Defaults_ToStopped()
Assert.Equal(VoiceActivationMode.Off, status.Mode);
Assert.Equal(VoiceRuntimeState.Stopped, status.State);
Assert.False(status.VoiceWakeLoaded);
+ Assert.Equal(0, status.PendingReplyCount);
+ Assert.False(status.CanSkipReply);
+ Assert.Null(status.CurrentReplyPreview);
Assert.Null(status.LastError);
}
From 06d508fd4d4e83b949bb294ce6c7efcde4c3ed39 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 00:57:10 +0000
Subject: [PATCH 33/83] Accept late talk mode replies after timeout
---
.../Services/Voice/VoiceService.cs | 64 ++++++++++++++++++-
.../VoiceServiceTransportTests.cs | 48 ++++++++++++--
2 files changed, 104 insertions(+), 8 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 8b28295..7adf99e 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -28,6 +28,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
+ private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
@@ -57,6 +58,8 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private readonly Queue<(string Text, string? SessionKey)> _pendingAssistantReplies = new();
private CancellationTokenSource? _playbackSkipCts;
private string? _currentReplyPreview;
+ private string? _lateReplySessionKey;
+ private DateTime? _lateReplyGraceUntilUtc;
private bool _disposed;
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
@@ -999,6 +1002,8 @@ private async Task HandleRecognizedTextAsync(string text)
lock (_gate)
{
_awaitingReply = true;
+ _lateReplySessionKey = null;
+ _lateReplyGraceUntilUtc = null;
_pendingManualTranscript = null;
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
@@ -1048,6 +1053,8 @@ public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = nu
effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? GetCurrentVoiceSessionKey() : sessionKey!;
_pendingManualTranscript = null;
_awaitingReply = true;
+ _lateReplySessionKey = null;
+ _lateReplyGraceUntilUtc = null;
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
@@ -1069,12 +1076,16 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
await Task.Delay(ReplyTimeout, cancellationToken);
var shouldResume = false;
+ string? lateReplySessionKey = null;
lock (_gate)
{
if (_awaitingReply &&
string.Equals(_lastTranscript, transcript, StringComparison.OrdinalIgnoreCase))
{
_awaitingReply = false;
+ lateReplySessionKey = GetCurrentVoiceSessionKey();
+ _lateReplySessionKey = lateReplySessionKey;
+ _lateReplyGraceUntilUtc = DateTime.UtcNow.Add(LateReplyGraceWindow);
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
@@ -1086,6 +1097,8 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
if (shouldResume)
{
+ _logger.Warn(
+ $"Voice reply wait timed out after {ReplyTimeout.TotalSeconds:0}s; accepting late replies for {LateReplyGraceWindow.TotalSeconds:0}s on session {lateReplySessionKey ?? "(none)"}");
await ResumeRecognitionSessionAsync(cancellationToken, "reply timeout");
}
}
@@ -1107,6 +1120,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
string text;
bool shouldStartPlaybackLoop = false;
+ bool acceptedViaLateReplyGrace = false;
lock (_gate)
{
@@ -1120,15 +1134,34 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
return;
}
- if (!ShouldAcceptAssistantReply(_awaitingReply, _isSpeaking, _pendingAssistantReplies.Count))
+ acceptedViaLateReplyGrace = ShouldAcceptLateAssistantReply(
+ _awaitingReply,
+ _isSpeaking,
+ _pendingAssistantReplies.Count,
+ _lateReplySessionKey,
+ _lateReplyGraceUntilUtc,
+ args.SessionKey,
+ DateTime.UtcNow);
+
+ if (!ShouldAcceptAssistantReply(_awaitingReply, _isSpeaking, _pendingAssistantReplies.Count, acceptedViaLateReplyGrace))
{
return;
}
_awaitingReply = false;
+ if (acceptedViaLateReplyGrace)
+ {
+ _lateReplySessionKey = null;
+ _lateReplyGraceUntilUtc = null;
+ }
text = PrepareReplyForSpeech(args.Message);
}
+ if (acceptedViaLateReplyGrace)
+ {
+ _logger.Warn($"Voice accepted late assistant reply after timeout for session {args.SessionKey}");
+ }
+
if (string.IsNullOrWhiteSpace(text))
{
var shouldResumeRecognition = false;
@@ -1350,9 +1383,32 @@ private static bool UsesCloudTextToSpeechRuntime(VoiceProviderOption provider)
return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
}
- internal static bool ShouldAcceptAssistantReply(bool awaitingReply, bool isSpeaking, int queuedReplyCount)
+ internal static bool ShouldAcceptAssistantReply(
+ bool awaitingReply,
+ bool isSpeaking,
+ int queuedReplyCount,
+ bool acceptedViaLateReplyGrace = false)
+ {
+ return awaitingReply || isSpeaking || queuedReplyCount > 0 || acceptedViaLateReplyGrace;
+ }
+
+ internal static bool ShouldAcceptLateAssistantReply(
+ bool awaitingReply,
+ bool isSpeaking,
+ int queuedReplyCount,
+ string? lateReplySessionKey,
+ DateTime? lateReplyGraceUntilUtc,
+ string? incomingSessionKey,
+ DateTime utcNow)
{
- return awaitingReply || isSpeaking || queuedReplyCount > 0;
+ return !awaitingReply &&
+ !isSpeaking &&
+ queuedReplyCount == 0 &&
+ !string.IsNullOrWhiteSpace(lateReplySessionKey) &&
+ !string.IsNullOrWhiteSpace(incomingSessionKey) &&
+ string.Equals(lateReplySessionKey, incomingSessionKey, StringComparison.OrdinalIgnoreCase) &&
+ lateReplyGraceUntilUtc.HasValue &&
+ utcNow <= lateReplyGraceUntilUtc.Value;
}
private static string CreateReplyPreview(string text)
@@ -1531,6 +1587,8 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_pendingManualTranscript = null;
_pendingAssistantReplies.Clear();
_currentReplyPreview = null;
+ _lateReplySessionKey = null;
+ _lateReplyGraceUntilUtc = null;
playbackSkipCts = _playbackSkipCts;
_playbackSkipCts = null;
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index fdd5c75..ce41a37 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -67,20 +67,58 @@ public void UsesCloudTextToSpeechRuntime_ReturnsTrueForWebSocketProviders()
}
[Theory]
- [InlineData(true, false, 0, true)]
- [InlineData(false, true, 0, true)]
- [InlineData(false, false, 1, true)]
- [InlineData(false, false, 0, false)]
+ [InlineData(true, false, 0, false, true)]
+ [InlineData(false, true, 0, false, true)]
+ [InlineData(false, false, 1, false, true)]
+ [InlineData(false, false, 0, true, true)]
+ [InlineData(false, false, 0, false, false)]
public void ShouldAcceptAssistantReply_MatchesPlaybackAndAwaitingState(
bool awaitingReply,
bool isSpeaking,
int queuedReplyCount,
+ bool acceptedViaLateReplyGrace,
bool expected)
{
var method = typeof(VoiceService).GetMethod(
"ShouldAcceptAssistantReply",
BindingFlags.NonPublic | BindingFlags.Static)!;
- var result = (bool)method.Invoke(null, [awaitingReply, isSpeaking, queuedReplyCount])!;
+ var result = (bool)method.Invoke(null, [awaitingReply, isSpeaking, queuedReplyCount, acceptedViaLateReplyGrace])!;
+
+ Assert.Equal(expected, result);
+ }
+
+ [Theory]
+ [InlineData(false, false, 0, "main", "main", 30, true)]
+ [InlineData(false, false, 0, "main", "main", 121, false)]
+ [InlineData(true, false, 0, "main", "main", 30, false)]
+ [InlineData(false, true, 0, "main", "main", 30, false)]
+ [InlineData(false, false, 1, "main", "main", 30, false)]
+ [InlineData(false, false, 0, "main", "other", 30, false)]
+ public void ShouldAcceptLateAssistantReply_OnlyMatchesBoundedGraceWindow(
+ bool awaitingReply,
+ bool isSpeaking,
+ int queuedReplyCount,
+ string lateReplySessionKey,
+ string incomingSessionKey,
+ int secondsAfterTimeout,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldAcceptLateAssistantReply",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ var timeoutUtc = new DateTime(2026, 3, 25, 0, 0, 0, DateTimeKind.Utc);
+ var graceUntilUtc = timeoutUtc.AddMinutes(2);
+ var result = (bool)method.Invoke(
+ null,
+ [
+ awaitingReply,
+ isSpeaking,
+ queuedReplyCount,
+ lateReplySessionKey,
+ graceUntilUtc,
+ incomingSessionKey,
+ timeoutUtc.AddSeconds(secondsAfterTimeout)
+ ])!;
Assert.Equal(expected, result);
}
From d8cd664e42a7c0ed61c6b95bb374179834a3b9bd Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 00:58:10 +0000
Subject: [PATCH 34/83] Add voice mode commit timeline to docs
---
docs/VOICE-MODE.md | 119 ++++++++++++++++++++++++++++++++-------------
1 file changed, 85 insertions(+), 34 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index ce75158..c3e71a5 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -12,40 +12,6 @@ This document defines the voice subsystem for the Windows node only. It introduc
- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
- Reuse the existing node capability pattern instead of introducing a parallel control path
-## Feature List
-
-### Story: Full-duplex / barge-in Talk Mode
-
-Allow the node to keep listening while it is speaking, so the user can interrupt or interleave speech without waiting for reply playback to finish.
-
-Notes:
-
-- the current Windows implementation is half-duplex: recognition is stopped or ignored while a reply is being spoken
-- a true implementation will require a lower-level audio pipeline rather than only `SpeechRecognizer` plus `SpeechSynthesizer`
-- practical requirements are likely to include:
- - microphone capture that can remain active during playback
- - acoustic echo cancellation / echo suppression
- - barge-in detection and playback interruption rules
- - a policy for whether interrupt speech cancels the current reply or queues behind it
- - additional runtime control/status so the UI can show when barge-in is armed
-- this should be treated as a separate engineering phase, not a small extension of the current Talk Mode runtime
-
-### Story: Compact Voice Status Strip
-
-Add an optional tiny always-on-top voice strip window for Talk Mode.
-
-Notes:
-
-- user-configurable show / hide
-- intended to be a minimal one-line-high display with a small amount of padding
-- should show:
- - current voice state
- - rolling live transcript while listening
- - rolling assistant text while speaking
- - a skip / cut-off control while speaking
-- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
-- if implemented later, the strip should use the shared runtime control API described below
-
## Non-Goals
- True full-duplex or chunk-streaming audio transport between node and gateway
@@ -90,6 +56,7 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- OpenClaw returns the assistant reply as normal chat output
- the node performs local TTS playback of that reply
- assistant replies are queued locally and spoken sequentially, with a short 500 ms pause between queued replies so overlapping responses are not lost
+- if a reply arrives after the normal 45-second wait timeout, the tray still accepts and speaks that late reply for a short bounded grace window so slow upstream responses are not silently lost
To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
@@ -706,3 +673,87 @@ The Windows node still keeps provider choice bounded:
- OpenClaw still owns the conversation/session flow
This keeps the provider surface narrow while still meeting the required MiniMax/ElevenLabs support direction.
+
+## Feature List (Backlog)
+
+### Story: Support non-local (or non-Windows, local) STT providers
+
+Allow the user to select a non-local STT provider like OpenAI Whisper, or a local non-Windows
+
+- Windows built-in local STT is working pretty well, however users should have the choice to utilise:
+ - a non-local STT provider
+ - a local non-Windows STT provider
+
+We're all about the choices, Baby!
+
+
+### Story: Full-duplex / barge-in Talk Mode
+
+Allow the node to keep listening while it is speaking, so the user can interrupt or interleave speech without waiting for reply playback to finish.
+
+Notes:
+
+- the current Windows implementation is half-duplex: recognition is stopped or ignored while a reply is being spoken
+- a true implementation will require a lower-level audio pipeline rather than only `SpeechRecognizer` plus `SpeechSynthesizer`
+- practical requirements are likely to include:
+ - microphone capture that can remain active during playback
+ - acoustic echo cancellation / echo suppression
+ - barge-in detection and playback interruption rules
+ - a policy for whether interrupt speech cancels the current reply or queues behind it
+ - additional runtime control/status so the UI can show when barge-in is armed
+- this should be treated as a separate engineering phase, not a small extension of the current Talk Mode runtime
+
+
+### Story: Compact Voice Status Strip
+
+Add an optional tiny always-on-top voice strip window for Talk Mode.
+
+Notes:
+
+- user-configurable show / hide
+- intended to be a minimal one-line-high display with a small amount of padding
+- should show:
+ - current voice state
+ - rolling live transcript while listening
+ - rolling assistant text while speaking
+ - a skip / cut-off control while speaking
+- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
+- if implemented later, the strip should use the shared runtime control API described elsewhere in this document.
+
+## Commit Timeline
+
+Append one new line to this timeline for every future voice-mode commit.
+
+- `2026-03-21` `be624fe` Added the initial Windows voice-mode foundation and the first AlwaysOn runtime.
+- `2026-03-23` `f40ffc3` Fixed voice chat transport and reply routing.
+- `2026-03-23` `a81d31e` Added configurable voice settings and the setup UI.
+- `2026-03-23` `197a89b` Integrated always-on voice mode with the tray chat workflow.
+- `2026-03-23` `1340bde` Fixed tray voice startup and chat window submission.
+- `2026-03-23` `aed8cb8` Removed the stale always-on autosubmit setting.
+- `2026-03-23` `25dd06b` Added focused coordinator coverage for tray voice chat.
+- `2026-03-23` `1336472` Addressed review findings and hardened the voice runtime.
+- `2026-03-23` `0f1028a` Documented MiniMax and ElevenLabs as required provider support.
+- `2026-03-23` `2c8a46d` Hardened tray chat voice message handling.
+- `2026-03-23` `fdbf48e` Fixed voice transport connection task reuse.
+- `2026-03-23` `b556c64` Grouped voice runtime services under `Services/Voice`.
+- `2026-03-23` `7f31c12` Implemented MiniMax TTS for voice mode.
+- `2026-03-23` `c64f168` Added editable TTS provider settings to voice mode.
+- `2026-03-23` `907a1a0` Moved voice settings into the main settings window.
+- `2026-03-23` `6dba89b` Extracted the hosted voice settings panel from the settings window.
+- `2026-03-23` `ded41a2` Generalized cloud TTS providers through catalog contracts.
+- `2026-03-23` `199e534` Renamed voice modes to `VoiceWake` and `TalkMode`.
+- `2026-03-23` `47efc3e` Moved voice settings below the node mode toggle.
+- `2026-03-23` `85d7b90` Made cloud TTS voice settings fully catalog-driven.
+- `2026-03-23` `c1cc0ff` Shipped the voice provider catalog with the tray app.
+- `2026-03-23` `83f05ee` Instrumented voice output latency and reduced TTS buffering.
+- `2026-03-23` `d137409` Tightened talk-mode speech-recognition filtering.
+- `2026-03-23` `05d7bae` Switched MiniMax TTS to the `api-uw` endpoint.
+- `2026-03-23` `5efcebf` Added catalog-driven MiniMax WebSocket TTS.
+- `2026-03-23` `45ff8f8` Fixed voice restart after settings save.
+- `2026-03-23` `71d0de4` Fixed MiniMax WebSocket voice playback routing.
+- `2026-03-23` `91ccec3` Added dynamic tray icons for voice states.
+- `2026-03-23` `2ff57fc` Added pre-response voice latency timing logs.
+- `2026-03-23` `ffa3fa2` Kept Talk Mode alive after input failures.
+- `2026-03-25` `c3ded30` Queued Talk Mode replies for sequential playback.
+- `2026-03-25` `82e2958` Added voice control and configuration APIs.
+- `2026-03-25` `06d508f` Accepted late Talk Mode replies after timeout.
From d61d82d0ee28bb1e20c5c45a700568cfb3586266 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 10:41:04 +0000
Subject: [PATCH 35/83] Delay talk mode ready state until recognizer warm-up
---
docs/VOICE-MODE.md | 1 +
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index c3e71a5..25d5a7d 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -757,3 +757,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` `c3ded30` Queued Talk Mode replies for sequential playback.
- `2026-03-25` `82e2958` Added voice control and configuration APIs.
- `2026-03-25` `06d508f` Accepted late Talk Mode replies after timeout.
+- `2026-03-25` Delayed the Talk Mode ready state until recognizer warm-up completes, so the UI does not advertise listening before the first recognition session has settled.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 7adf99e..f130fcd 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -29,6 +29,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
+ private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
@@ -537,7 +538,8 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
}
await EnsureChatTransportAsync(runtimeCts.Token);
- await StartRecognitionSessionAsync();
+ await StartRecognitionSessionAsync(updateListeningStatus: false);
+ await Task.Delay(InitialRecognitionReadyDelay, runtimeCts.Token);
lock (_gate)
{
@@ -551,6 +553,7 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
}
}
+ _logger.Info($"Speech recognition warm-up completed ({InitialRecognitionReadyDelay.TotalMilliseconds:0}ms)");
_logger.Info("Voice runtime started in mode TalkMode");
}
catch
@@ -688,7 +691,7 @@ private static TaskCompletionSource<bool> GetOrCreateTransportReadySource(
return new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
}
- private async Task StartRecognitionSessionAsync()
+ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = true)
{
SpeechRecognizer? recognizer;
@@ -707,7 +710,7 @@ private async Task StartRecognitionSessionAsync()
lock (_gate)
{
_recognitionActive = true;
- if (_status.Running && !_awaitingReply && !_isSpeaking)
+ if (updateListeningStatus && _status.Running && !_awaitingReply && !_isSpeaking)
{
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
From d1c365567c15a7192f4b8f73ec3da009f7989785 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 10:57:39 +0000
Subject: [PATCH 36/83] Recycle stalled talk mode recognition sessions
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 73 ++++++++++++++++++-
2 files changed, 73 insertions(+), 1 deletion(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 25d5a7d..0ef05d1 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -758,3 +758,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` `82e2958` Added voice control and configuration APIs.
- `2026-03-25` `06d508f` Accepted late Talk Mode replies after timeout.
- `2026-03-25` Delayed the Talk Mode ready state until recognizer warm-up completes, so the UI does not advertise listening before the first recognition session has settled.
+- `2026-03-25` Added a recognizer health-check watchdog so Talk Mode recycles a started-but-deaf recognition session instead of waiting minutes for Windows to cancel it.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index f130fcd..e985777 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -30,6 +30,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
+ private static readonly TimeSpan RecognitionHealthCheckDelay = TimeSpan.FromSeconds(15);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
@@ -49,6 +50,8 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private SpeechSynthesizer? _speechSynthesizer;
private MediaPlayer? _mediaPlayer;
private bool _recognitionActive;
+ private int _recognitionSessionGeneration;
+ private bool _recognitionHealthCheckArmed;
private bool _awaitingReply;
private bool _isSpeaking;
private bool _replyPlaybackLoopActive;
@@ -539,6 +542,7 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
await EnsureChatTransportAsync(runtimeCts.Token);
await StartRecognitionSessionAsync(updateListeningStatus: false);
+ ArmRecognitionHealthCheck();
await Task.Delay(InitialRecognitionReadyDelay, runtimeCts.Token);
lock (_gate)
@@ -694,14 +698,19 @@ private static TaskCompletionSource<bool> GetOrCreateTransportReadySource(
private async Task StartRecognitionSessionAsync(bool updateListeningStatus = true)
{
SpeechRecognizer? recognizer;
+ CancellationToken runtimeToken;
+ int generation;
lock (_gate)
{
recognizer = _speechRecognizer;
- if (recognizer == null || _recognitionActive)
+ if (recognizer == null || _recognitionActive || _runtimeCts == null)
{
return;
}
+
+ runtimeToken = _runtimeCts.Token;
+ generation = ++_recognitionSessionGeneration;
}
_logger.Info("Starting speech recognition session");
@@ -721,6 +730,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
}
_logger.Info("Speech recognition session started");
+ _ = MonitorRecognitionSessionHealthAsync(generation, runtimeToken);
}
private async Task ResumeRecognitionSessionAsync(
@@ -885,6 +895,7 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
text = args.Hypothesis?.Text?.Trim();
sessionKey = GetCurrentVoiceSessionKey();
+ _recognitionHealthCheckArmed = false;
if (_status.State != VoiceRuntimeState.RecordingUtterance)
{
_status = BuildRunningStatus(
@@ -934,6 +945,7 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscript = text;
_lastTranscriptUtc = DateTime.UtcNow;
+ _recognitionHealthCheckArmed = false;
cancellationToken = _runtimeCts.Token;
sessionKey = GetCurrentVoiceSessionKey();
}
@@ -1480,6 +1492,9 @@ private async void OnSpeechRecognitionCompleted(
}
_recognitionActive = false;
+ _recognitionHealthCheckArmed =
+ args.Status == SpeechRecognitionResultStatus.UserCanceled ||
+ args.Status == SpeechRecognitionResultStatus.TimeoutExceeded;
token = _runtimeCts.Token;
shouldRestart = _status.Running &&
_status.Mode == VoiceActivationMode.TalkMode &&
@@ -1729,6 +1744,62 @@ private VoiceStatusInfo BuildRunningStatus(
};
}
+ private void ArmRecognitionHealthCheck()
+ {
+ lock (_gate)
+ {
+ _recognitionHealthCheckArmed = true;
+ }
+ }
+
+ private async Task MonitorRecognitionSessionHealthAsync(int generation, CancellationToken cancellationToken)
+ {
+ try
+ {
+ await Task.Delay(RecognitionHealthCheckDelay, cancellationToken);
+
+ var shouldRecycle = false;
+ lock (_gate)
+ {
+ shouldRecycle =
+ _recognitionHealthCheckArmed &&
+ _recognitionActive &&
+ _runtimeCts != null &&
+ !_runtimeCts.IsCancellationRequested &&
+ _status.Running &&
+ _status.Mode == VoiceActivationMode.TalkMode &&
+ !_awaitingReply &&
+ !_isSpeaking &&
+ generation == _recognitionSessionGeneration;
+
+ if (shouldRecycle)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ "Speech recognizer stalled; restarting listening.");
+ }
+ }
+
+ if (!shouldRecycle)
+ {
+ return;
+ }
+
+ _logger.Warn(
+ $"Speech recognition session produced no hypotheses/results within {RecognitionHealthCheckDelay.TotalSeconds:0}s; recycling session");
+ await ResumeRecognitionSessionAsync(cancellationToken, "recognition health check");
+ }
+ catch (OperationCanceledException)
+ {
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Speech recognition health check failed: {ex.Message}");
+ }
+ }
+
private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
{
var settings = _settings.Voice;
From d569f71b4f53235a476b16cd1c6bbd8f75144112 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 11:12:58 +0000
Subject: [PATCH 37/83] Revert talk mode to direct chat send
---
docs/VOICE-MODE.md | 27 +-
src/OpenClaw.Shared/VoiceModeSchema.cs | 11 -
src/OpenClaw.Tray.WinUI/App.xaml.cs | 5 -
.../Controls/VoiceSettingsPanel.xaml | 5 -
.../Controls/VoiceSettingsPanel.xaml.cs | 28 +-
.../Services/Voice/VoiceChatContracts.cs | 12 -
.../Services/Voice/VoiceChatCoordinator.cs | 50 ----
.../Services/Voice/VoiceDisplayHelper.cs | 1 -
.../Services/Voice/VoiceService.cs | 91 +-----
.../Windows/VoiceModeWindow.xaml.cs | 8 -
.../Windows/WebChatWindow.xaml.cs | 283 +-----------------
.../VoiceModeSchemaTests.cs | 1 -
.../SettingsRoundTripTests.cs | 4 +-
.../VoiceChatCoordinatorTests.cs | 125 ++------
.../WebChatWindowSecurityTests.cs | 71 -----
15 files changed, 51 insertions(+), 671 deletions(-)
delete mode 100644 tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 0ef05d1..118a150 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -51,8 +51,8 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- the node captures audio locally
- local speech recognition turns that audio into transcript text
- interim hypotheses are surfaced live, but only final `Medium` or `High` confidence recognizer results are submitted
-- if the tray chat window is open and ready, the final transcript is submitted through the tray chat window's own compose/send path
-- otherwise, the transcript is sent to OpenClaw via direct `chat.send` on the main session
+- the tray chat window, when open, mirrors the live transcript draft locally
+- the finalized transcript is always sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
- the node performs local TTS playback of that reply
- assistant replies are queued locally and spoken sequentially, with a short 500 ms pause between queued replies so overlapping responses are not lost
@@ -185,12 +185,7 @@ That produced two UX problems:
- chosen
- keeps a single session and single tray chat history
- allows interim hypotheses to appear in the tray compose box in near real time
- - allows the tray app to submit through the same UI path as typed messages when the tray chat window is open
-4. Hybrid submission path
- - chosen
- - when the tray chat window is open, voice submits through the chat window DOM send path
- - when the tray chat window is closed or unavailable, voice falls back to direct `chat.send`
- - preserves windowless voice mode without forcing the transport layer to depend on WebView availability
+ - does not require the voice transport to depend on WebView DOM submission
### Chosen Approach
@@ -201,10 +196,8 @@ The embedded [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatW
- interim STT hypotheses from Windows speech recognition are injected into the tray chat compose box while the user is speaking
- if the chat window opens during an utterance, the current buffered transcript is copied into the compose box immediately
- if the chat window closes during an utterance, voice continues windowless and the final utterance still submits
-- if the chat window is open and ready when the utterance finalizes, the tray app either auto-submits through the page's own send path or leaves the draft for manual send, depending on `Voice.TalkMode.ChatWindowSubmitMode`
-- in `WaitForUser` mode, voice capture pauses after finalizing the draft so the next utterance does not overwrite the unsent message
-- if the chat window is not open or not ready, the voice service falls back to direct `chat.send`
-- rendered chat content inside the tray window is still sanitized to remove `<relevant-memories>...</relevant-memories>` blocks as a fallback for messages that were sent while windowless
+- the final utterance always goes through the voice service's direct `chat.send` path
+- the tray chat window remains a draft/view surface rather than a transport dependency
This is intentionally a tray-local integration decision, not a protocol-level rewrite of the stored upstream transcript.
@@ -212,9 +205,8 @@ This is intentionally a tray-local integration decision, not a protocol-level re
- preserves a single visible conversation for the user
- avoids a second voice-only session in the tray UI
-- when the tray chat window is open, voice follows the same send path as typed tray-chat messages
-- depends on DOM integration inside the embedded WebView chat surface because OpenClaw does not currently expose a dedicated draft/update or voice-submit API for the tray app
-- still requires a direct fallback path for windowless voice mode
+- uses only one send path for voice turns, which is simpler to reason about and debug
+- keeps a light DOM integration inside the embedded WebView chat surface for draft mirroring only
- only affects the tray app chat window; other clients still render upstream content according to their own rules
## Provider Selection
@@ -461,8 +453,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"TalkMode": {
"MinSpeechMs": 250,
"EndSilenceMs": 900,
- "MaxUtteranceMs": 15000,
- "ChatWindowSubmitMode": "AutoSend"
+ "MaxUtteranceMs": 15000
}
},
"VoiceProviderConfiguration": {
@@ -527,7 +518,6 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Voice.TalkMode.MinSpeechMs` | int | `250` | talk mode | Minimum detected speech duration before an utterance is treated as real input |
| `Voice.TalkMode.EndSilenceMs` | int | `900` | talk mode | Silence timeout used to finalize an utterance |
| `Voice.TalkMode.MaxUtteranceMs` | int | `15000` | talk mode | Hard cap on utterance length before forced submission/finalization |
-| `Voice.TalkMode.ChatWindowSubmitMode` | enum | `AutoSend` | talk mode | When the tray chat window is open, either auto-send the finalized utterance or leave it in the compose box for manual send |
| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching an `Assets\\voice-providers.json` entry |
| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
| `VoiceProviderConfiguration.Providers[].Values["model"]` | string? | provider default | cloud providers | Model identifier inserted into the configured request template |
@@ -759,3 +749,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` `06d508f` Accepted late Talk Mode replies after timeout.
- `2026-03-25` Delayed the Talk Mode ready state until recognizer warm-up completes, so the UI does not advertise listening before the first recognition session has settled.
- `2026-03-25` Added a recognizer health-check watchdog so Talk Mode recycles a started-but-deaf recognition session instead of waiting minutes for Windows to cancel it.
+- `2026-03-25` Reverted Talk Mode to a single direct `chat.send` path and reduced the tray chat integration back to draft mirroring only.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 8b2f542..3fe9d1b 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -52,19 +52,11 @@ public enum VoiceRuntimeState
ListeningContinuously,
RecordingUtterance,
SubmittingAudio,
- PendingManualSend,
AwaitingResponse,
PlayingResponse,
Error
}
-[JsonConverter(typeof(JsonStringEnumConverter<VoiceChatWindowSubmitMode>))]
-public enum VoiceChatWindowSubmitMode
-{
- AutoSend,
- WaitForUser
-}
-
public sealed class VoiceSettings
{
public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
@@ -96,7 +88,6 @@ public sealed class TalkModeSettings
public int MinSpeechMs { get; set; } = 250;
public int EndSilenceMs { get; set; } = 900;
public int MaxUtteranceMs { get; set; } = 15000;
- public VoiceChatWindowSubmitMode ChatWindowSubmitMode { get; set; } = VoiceChatWindowSubmitMode.AutoSend;
}
public sealed class VoiceAudioDeviceInfo
@@ -310,7 +301,6 @@ public override VoiceRuntimeState Read(ref Utf8JsonReader reader, Type typeToCon
"ListeningContinuously" => VoiceRuntimeState.ListeningContinuously,
"RecordingUtterance" => VoiceRuntimeState.RecordingUtterance,
"SubmittingAudio" => VoiceRuntimeState.SubmittingAudio,
- "PendingManualSend" => VoiceRuntimeState.PendingManualSend,
"AwaitingResponse" => VoiceRuntimeState.AwaitingResponse,
"PlayingResponse" => VoiceRuntimeState.PlayingResponse,
"Error" => VoiceRuntimeState.Error,
@@ -330,7 +320,6 @@ public override void Write(Utf8JsonWriter writer, VoiceRuntimeState value, JsonS
VoiceRuntimeState.ListeningContinuously => "ListeningContinuously",
VoiceRuntimeState.RecordingUtterance => "RecordingUtterance",
VoiceRuntimeState.SubmittingAudio => "SubmittingAudio",
- VoiceRuntimeState.PendingManualSend => "PendingManualSend",
VoiceRuntimeState.AwaitingResponse => "AwaitingResponse",
VoiceRuntimeState.PlayingResponse => "PlayingResponse",
VoiceRuntimeState.Error => "Error",
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index eae2e21..9e90f64 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -259,7 +259,6 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
_voiceService = new VoiceService(new AppLogger(), _settings);
_voiceChatCoordinator = new VoiceChatCoordinator(
_voiceService,
- () => _settings.Voice.TalkMode.ChatWindowSubmitMode,
new DispatcherQueueAdapter(_dispatcherQueue!));
_voiceChatCoordinator.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
@@ -2399,10 +2398,6 @@ private void ExitApplication()
_voiceChatCoordinator.ConversationTurnAvailable -= OnVoiceConversationTurnAvailable;
_voiceChatCoordinator.Dispose();
}
- if (_voiceService != null)
- {
- _voiceService.TranscriptSubmitter = null;
- }
_voiceService?.Dispose();
Exit();
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index ad5e95d..5277f37 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -58,11 +58,6 @@
<CheckBox x:Name="VoiceConversationToastsCheckBox"
Content="Show voice transcripts and replies as toasts"/>
- <ComboBox x:Name="VoiceChatWindowSubmitModeComboBox" Header="When the tray chat window is open">
- <ComboBoxItem Content="Send automatically" Tag="AutoSend"/>
- <ComboBoxItem Content="Fill message box and wait for me to send" Tag="WaitForUser"/>
- </ComboBox>
-
<TextBlock x:Name="VoiceSettingsInfoTextBlock"
Style="{StaticResource CaptionTextBlockStyle}"
Foreground="{ThemeResource TextFillColorSecondaryBrush}"
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 1d2903d..103b31d 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -66,8 +66,7 @@ public async Task ApplyAsync(SettingsManager settings)
{
MinSpeechMs = settings.Voice.TalkMode.MinSpeechMs,
EndSilenceMs = settings.Voice.TalkMode.EndSilenceMs,
- MaxUtteranceMs = settings.Voice.TalkMode.MaxUtteranceMs,
- ChatWindowSubmitMode = GetSelectedChatWindowSubmitMode()
+ MaxUtteranceMs = settings.Voice.TalkMode.MaxUtteranceMs
}
};
settings.Voice = voiceSettings;
@@ -94,7 +93,6 @@ private void LoadVoiceSettings()
_voiceProviderConfigurationDraft = _settings.VoiceProviderConfiguration.Clone();
LoadVoiceProviders();
SelectVoiceMode(_settings.Voice.Mode);
- SelectChatWindowSubmitMode(_settings.Voice.TalkMode.ChatWindowSubmitMode);
VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
UpdateVoiceProviderSettingsEditor();
UpdateVoiceSettingsInfo();
@@ -196,30 +194,6 @@ private VoiceActivationMode GetSelectedVoiceMode()
};
}
- private void SelectChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
- {
- var target = mode == VoiceChatWindowSubmitMode.WaitForUser ? "WaitForUser" : "AutoSend";
-
- foreach (var item in VoiceChatWindowSubmitModeComboBox.Items.OfType<ComboBoxItem>())
- {
- if (string.Equals(item.Tag?.ToString(), target, StringComparison.Ordinal))
- {
- VoiceChatWindowSubmitModeComboBox.SelectedItem = item;
- return;
- }
- }
-
- VoiceChatWindowSubmitModeComboBox.SelectedIndex = 0;
- }
-
- private VoiceChatWindowSubmitMode GetSelectedChatWindowSubmitMode()
- {
- var tag = (VoiceChatWindowSubmitModeComboBox.SelectedItem as ComboBoxItem)?.Tag?.ToString();
- return tag == "WaitForUser"
- ? VoiceChatWindowSubmitMode.WaitForUser
- : VoiceChatWindowSubmitMode.AutoSend;
- }
-
private void UpdateVoiceSettingsInfo()
{
var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Name ?? "Windows Speech Recognition";
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
index fb12212..2450b97 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
@@ -1,6 +1,5 @@
using OpenClaw.Shared;
using System;
-using System.Threading.Tasks;
namespace OpenClawTray.Services.Voice;
@@ -28,8 +27,6 @@ public interface IVoiceRuntime
{
event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
event EventHandler<VoiceTranscriptDraftEventArgs>? TranscriptDraftUpdated;
- Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? TranscriptSubmitter { get; set; }
- void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null);
}
public interface IVoiceConfigurationApi
@@ -57,14 +54,5 @@ public interface IVoiceRuntimeControlApi
public interface IVoiceChatWindow
{
bool IsClosed { get; }
- event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
Task UpdateVoiceTranscriptDraftAsync(string text, bool clear);
- Task<bool> TrySubmitVoiceTranscriptAsync(string text);
- Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text);
-}
-
-public sealed class VoiceTranscriptSubmittedEventArgs : EventArgs
-{
- public string Text { get; set; } = "";
- public string? SessionKey { get; set; }
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
index 39e88ff..1ac8b51 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
@@ -7,7 +7,6 @@ namespace OpenClawTray.Services.Voice;
public sealed class VoiceChatCoordinator : IDisposable
{
private readonly IVoiceRuntime _voiceService;
- private readonly Func<VoiceChatWindowSubmitMode> _getSubmitMode;
private readonly IUiDispatcher _dispatcher;
private readonly object _gate = new();
@@ -19,16 +18,13 @@ public sealed class VoiceChatCoordinator : IDisposable
public VoiceChatCoordinator(
IVoiceRuntime voiceService,
- Func<VoiceChatWindowSubmitMode> getSubmitMode,
IUiDispatcher dispatcher)
{
_voiceService = voiceService;
- _getSubmitMode = getSubmitMode;
_dispatcher = dispatcher;
_voiceService.ConversationTurnAvailable += OnVoiceConversationTurnAvailable;
_voiceService.TranscriptDraftUpdated += OnVoiceTranscriptDraftUpdated;
- _voiceService.TranscriptSubmitter = SubmitVoiceTranscriptAsync;
}
public void AttachWindow(IVoiceChatWindow window)
@@ -42,13 +38,7 @@ public void AttachWindow(IVoiceChatWindow window)
return;
}
- if (_webChatWindow != null)
- {
- _webChatWindow.VoiceTranscriptSubmitted -= OnWebChatVoiceTranscriptSubmitted;
- }
-
_webChatWindow = window;
- _webChatWindow.VoiceTranscriptSubmitted += OnWebChatVoiceTranscriptSubmitted;
}
_ = window.UpdateVoiceTranscriptDraftAsync(
@@ -70,7 +60,6 @@ public void DetachWindow(IVoiceChatWindow? window)
return;
}
- _webChatWindow.VoiceTranscriptSubmitted -= OnWebChatVoiceTranscriptSubmitted;
_webChatWindow = null;
}
}
@@ -86,7 +75,6 @@ public void Dispose()
DetachWindow(null);
_voiceService.ConversationTurnAvailable -= OnVoiceConversationTurnAvailable;
_voiceService.TranscriptDraftUpdated -= OnVoiceTranscriptDraftUpdated;
- _voiceService.TranscriptSubmitter = null;
}
private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationTurnEventArgs args)
@@ -117,42 +105,4 @@ private void OnVoiceTranscriptDraftUpdated(object? sender, VoiceTranscriptDraftE
_ = window.UpdateVoiceTranscriptDraftAsync(_voiceTranscriptDraftText, args.Clear);
});
}
-
- private void OnWebChatVoiceTranscriptSubmitted(object? sender, VoiceTranscriptSubmittedEventArgs args)
- {
- _voiceTranscriptDraftText = string.Empty;
- _voiceService.NotifyManualTranscriptSubmitted(args.Text, args.SessionKey);
- }
-
- private async Task<VoiceTranscriptSubmitOutcome> SubmitVoiceTranscriptAsync(string text, string? sessionKey)
- {
- try
- {
- IVoiceChatWindow? window;
- lock (_gate)
- {
- window = _webChatWindow;
- }
-
- if (window == null || window.IsClosed)
- {
- return VoiceTranscriptSubmitOutcome.Unavailable;
- }
-
- if (_getSubmitMode() == VoiceChatWindowSubmitMode.WaitForUser)
- {
- return await window.PrepareVoiceTranscriptForManualSendAsync(text)
- ? VoiceTranscriptSubmitOutcome.DeferredToUser
- : VoiceTranscriptSubmitOutcome.Unavailable;
- }
-
- return await window.TrySubmitVoiceTranscriptAsync(text)
- ? VoiceTranscriptSubmitOutcome.Submitted
- : VoiceTranscriptSubmitOutcome.Unavailable;
- }
- catch
- {
- return VoiceTranscriptSubmitOutcome.Unavailable;
- }
- }
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
index 496c643..a671cf0 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceDisplayHelper.cs
@@ -23,7 +23,6 @@ public static string GetStateLabel(VoiceRuntimeState state)
VoiceRuntimeState.ListeningContinuously => "Listening",
VoiceRuntimeState.RecordingUtterance => "Recording",
VoiceRuntimeState.SubmittingAudio => "Sending",
- VoiceRuntimeState.PendingManualSend => "Waiting for send",
VoiceRuntimeState.AwaitingResponse => "Waiting for reply",
VoiceRuntimeState.PlayingResponse => "Speaking",
VoiceRuntimeState.Paused => "Paused",
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index e985777..e063ce2 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -58,7 +58,6 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private bool _quickPaused;
private string? _lastTranscript;
private DateTime _lastTranscriptUtc;
- private string? _pendingManualTranscript;
private readonly Queue<(string Text, string? SessionKey)> _pendingAssistantReplies = new();
private CancellationTokenSource? _playbackSkipCts;
private string? _currentReplyPreview;
@@ -68,7 +67,6 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
public event EventHandler<VoiceTranscriptDraftEventArgs>? TranscriptDraftUpdated;
- public Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? TranscriptSubmitter { get; set; }
public VoiceService(IOpenClawLogger logger, SettingsManager settings)
{
@@ -921,7 +919,6 @@ private async Task HandleRecognizedTextAsync(string text)
var pipelineStopwatch = Stopwatch.StartNew();
long recognitionStopElapsedMs = 0;
long transportReadyElapsedMs = 0;
- long traySubmitElapsedMs = 0;
long directSendElapsedMs = 0;
lock (_gate)
@@ -961,11 +958,9 @@ private async Task HandleRecognizedTextAsync(string text)
transportReadyElapsedMs = pipelineStopwatch.ElapsedMilliseconds - recognitionStopElapsedMs;
OpenClawGatewayClient? client;
- Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? transcriptSubmitter;
lock (_gate)
{
client = _chatClient;
- transcriptSubmitter = TranscriptSubmitter;
}
if (client == null)
@@ -974,52 +969,18 @@ private async Task HandleRecognizedTextAsync(string text)
}
_logger.Info($"Voice transcript captured: {text}");
- var submitOutcome = VoiceTranscriptSubmitOutcome.Unavailable;
- if (transcriptSubmitter != null)
- {
- var submitStopwatch = Stopwatch.StartNew();
- submitOutcome = await transcriptSubmitter(text, sessionKey);
- traySubmitElapsedMs = submitStopwatch.ElapsedMilliseconds;
- _logger.Info($"Voice tray submit path: outcome={submitOutcome} elapsed={traySubmitElapsedMs}ms");
- }
-
- if (submitOutcome == VoiceTranscriptSubmitOutcome.Unavailable)
- {
- var directSendStopwatch = Stopwatch.StartNew();
- await client.SendChatMessageAsync(text, sessionKey);
- directSendElapsedMs = directSendStopwatch.ElapsedMilliseconds;
- submitOutcome = VoiceTranscriptSubmitOutcome.Submitted;
- _logger.Info($"Voice direct send path: elapsed={directSendElapsedMs}ms");
- }
-
- if (submitOutcome == VoiceTranscriptSubmitOutcome.DeferredToUser)
- {
- _logger.Info(
- $"Voice pre-response latency: recognitionStop={recognitionStopElapsedMs}ms transportReady={transportReadyElapsedMs}ms traySubmit={traySubmitElapsedMs}ms total={pipelineStopwatch.ElapsedMilliseconds}ms (deferred to user)");
- lock (_gate)
- {
- _awaitingReply = false;
- _pendingManualTranscript = text;
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.PendingManualSend,
- "Draft ready in tray chat window. Send it manually to continue.");
- _status.LastUtteranceUtc = DateTime.UtcNow;
- }
-
- RaiseTranscriptDraft(text, sessionKey, clear: false);
- return;
- }
+ var directSendStopwatch = Stopwatch.StartNew();
+ await client.SendChatMessageAsync(text, sessionKey);
+ directSendElapsedMs = directSendStopwatch.ElapsedMilliseconds;
+ _logger.Info($"Voice direct send path: elapsed={directSendElapsedMs}ms");
_logger.Info(
- $"Voice pre-response latency: recognitionStop={recognitionStopElapsedMs}ms transportReady={transportReadyElapsedMs}ms traySubmit={traySubmitElapsedMs}ms directSend={directSendElapsedMs}ms total={pipelineStopwatch.ElapsedMilliseconds}ms");
+ $"Voice pre-response latency: recognitionStop={recognitionStopElapsedMs}ms transportReady={transportReadyElapsedMs}ms directSend={directSendElapsedMs}ms total={pipelineStopwatch.ElapsedMilliseconds}ms");
lock (_gate)
{
_awaitingReply = true;
_lateReplySessionKey = null;
_lateReplyGraceUntilUtc = null;
- _pendingManualTranscript = null;
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
@@ -1052,38 +1013,6 @@ private async Task HandleRecognizedTextAsync(string text)
}
}
- public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null)
- {
- CancellationToken cancellationToken;
- string effectiveSessionKey;
-
- lock (_gate)
- {
- if (_runtimeCts == null || _status.Mode != VoiceActivationMode.TalkMode || !_status.Running)
- {
- return;
- }
-
- cancellationToken = _runtimeCts.Token;
- effectiveSessionKey = string.IsNullOrWhiteSpace(sessionKey) ? GetCurrentVoiceSessionKey() : sessionKey!;
- _pendingManualTranscript = null;
- _awaitingReply = true;
- _lateReplySessionKey = null;
- _lateReplyGraceUntilUtc = null;
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.AwaitingResponse,
- _status.LastError);
- _status.LastUtteranceUtc = DateTime.UtcNow;
- }
-
- _logger.Info("Voice response wait started (manual submit)");
- RaiseConversationTurn(VoiceConversationDirection.Outgoing, text, effectiveSessionKey);
- RaiseTranscriptDraft(string.Empty, effectiveSessionKey, clear: true);
- _ = MonitorReplyTimeoutAsync(text, cancellationToken);
- }
-
private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken cancellationToken)
{
try
@@ -1602,7 +1531,6 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_awaitingReply = false;
_isSpeaking = false;
_replyPlaybackLoopActive = false;
- _pendingManualTranscript = null;
_pendingAssistantReplies.Clear();
_currentReplyPreview = null;
_lateReplySessionKey = null;
@@ -1880,8 +1808,7 @@ private static VoiceSettings Clone(VoiceSettings source)
{
MinSpeechMs = source.TalkMode.MinSpeechMs,
EndSilenceMs = source.TalkMode.EndSilenceMs,
- MaxUtteranceMs = source.TalkMode.MaxUtteranceMs,
- ChatWindowSubmitMode = source.TalkMode.ChatWindowSubmitMode
+ MaxUtteranceMs = source.TalkMode.MaxUtteranceMs
}
};
}
@@ -2003,9 +1930,3 @@ public sealed class VoiceTranscriptDraftEventArgs : EventArgs
public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
}
-public enum VoiceTranscriptSubmitOutcome
-{
- Unavailable,
- Submitted,
- DeferredToUser
-}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
index cac4543..01694ae 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceModeWindow.xaml.cs
@@ -62,7 +62,6 @@ public void RefreshStatus()
new("Text to speech", ResolveProviderName(catalog.TextToSpeechProviders, _settings.Voice.TextToSpeechProviderId, "Windows Speech Synthesis")),
new("Listen device", DescribeDevice(_settings.Voice.InputDeviceId, "System default microphone")),
new("Talk device", DescribeDevice(_settings.Voice.OutputDeviceId, "System default speaker")),
- new("Chat window", DescribeChatWindowSubmitMode(_settings.Voice.TalkMode.ChatWindowSubmitMode)),
new("Voice toasts", _settings.Voice.ShowConversationToasts ? "Enabled" : "Disabled")
};
@@ -97,13 +96,6 @@ private static string DescribeDevice(string? deviceId, string defaultLabel)
return string.IsNullOrWhiteSpace(deviceId) ? defaultLabel : "Selected device";
}
- private static string DescribeChatWindowSubmitMode(VoiceChatWindowSubmitMode mode)
- {
- return mode == VoiceChatWindowSubmitMode.WaitForUser
- ? "Fill message box and wait for send"
- : "Send automatically";
- }
-
private static string FormatTimestamp(DateTime? value)
{
return value?.ToLocalTime().ToString("HH:mm:ss") ?? "None";
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index e772d62..c660697 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -18,40 +18,24 @@ namespace OpenClawTray.Windows;
public sealed partial class WebChatWindow : WindowEx
, IVoiceChatWindow
{
- private const string VoiceManualSubmitMessageType = "voice-manual-submit";
private readonly string _gatewayUrl;
private readonly string _token;
- private readonly string _voiceMessageNonce = Guid.NewGuid().ToString("N");
private string _pendingVoiceDraft = string.Empty;
- private string? _trustedVoiceMessageOrigin;
// Store event handlers for cleanup
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationCompletedEventArgs>? _navigationCompletedHandler;
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationStartingEventArgs>? _navigationStartingHandler;
- private TypedEventHandler<CoreWebView2, CoreWebView2WebMessageReceivedEventArgs>? _webMessageReceivedHandler;
public bool IsClosed { get; private set; }
- public event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
- private const string TrayVoiceIntegrationScriptTemplate = """
+ private const string TrayVoiceIntegrationScript = """
(() => {
- const submitNonce = __VOICE_NONCE__;
- const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
- const sanitize = (value) => typeof value === 'string' ? value.replace(memoryPattern, '').trimStart() : value;
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
- const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
let desiredDraft = '';
const findComposer = () => {
const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
return candidates.find(isVisible) || null;
};
- const getComposerValue = (composer) => sanitize(('value' in composer ? composer.value : composer.textContent) || '');
- const isSendLike = (button) => {
- if (!button || !isVisible(button)) return false;
- if (button.disabled === true || button.getAttribute('aria-disabled') === 'true') return false;
- const text = ((button.innerText || button.textContent || '') + ' ' + (button.getAttribute('aria-label') || '')).trim().toLowerCase();
- return text === 'send' || text.startsWith('send ') || text.includes('send ↵') || text.includes('send');
- };
const setElementValue = (el, value) => {
if ('value' in el) {
const proto = el.tagName === 'TEXTAREA' ? HTMLTextAreaElement.prototype : HTMLInputElement.prototype;
@@ -77,117 +61,11 @@ public sealed partial class WebChatWindow : WindowEx
setElementValue(composer, desiredDraft);
return true;
};
- const findSendButton = () => {
- const buttons = Array.from(document.querySelectorAll('button, [role="button"], input[type="submit"]'));
- return buttons.find(isSendLike) || null;
- };
- const waitForDraftToLeaveComposer = async (expectedText) => {
- for (let i = 0; i < 20; i++) {
- await delay(75);
- const composer = findComposer();
- if (!composer) return true;
- const current = getComposerValue(composer);
- if (!current || current !== sanitize(expectedText || '')) {
- return true;
- }
- }
- return false;
- };
- const submitDraft = async (text) => {
- desiredDraft = sanitize(text || '');
- pendingManual = false;
- const composer = findComposer();
- if (!composer) return false;
- setElementValue(composer, desiredDraft);
- await delay(0);
- const sendButton = findSendButton();
- if (sendButton) {
- sendButton.click();
- const sent = await waitForDraftToLeaveComposer(desiredDraft);
- if (sent) {
- desiredDraft = '';
- }
- return sent;
- }
- composer.dispatchEvent(new KeyboardEvent('keydown', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
- composer.dispatchEvent(new KeyboardEvent('keypress', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
- composer.dispatchEvent(new KeyboardEvent('keyup', { key: 'Enter', code: 'Enter', which: 13, keyCode: 13, bubbles: true }));
- const sent = await waitForDraftToLeaveComposer(desiredDraft);
- if (sent) {
- desiredDraft = '';
- }
- return sent;
- };
- let pendingManual = false;
- const emitManualSubmit = () => {
- if (!pendingManual) return;
- const composer = findComposer();
- if (!composer) return;
- const current = sanitize(('value' in composer ? composer.value : composer.textContent) || '');
- if (!current) return;
- pendingManual = false;
- desiredDraft = '';
- if (window.chrome?.webview?.postMessage) {
- window.chrome.webview.postMessage(JSON.stringify({ type: 'voice-manual-submit', text: current, nonce: submitNonce }));
- }
- };
- const cleanTextNodes = () => {
- if (!document.body) return;
- const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
- const nodes = [];
- let current;
- while ((current = walker.nextNode())) {
- nodes.push(current);
- }
- for (const node of nodes) {
- if (!node || !node.parentElement) continue;
- const tag = node.parentElement.tagName;
- if (tag === 'SCRIPT' || tag === 'STYLE' || tag === 'TEXTAREA') continue;
- const original = node.textContent || '';
- const cleaned = sanitize(original);
- if (cleaned !== original) {
- node.textContent = cleaned;
- }
- }
- };
- let cleanScheduled = false;
- const scheduleClean = () => {
- if (cleanScheduled) return;
- cleanScheduled = true;
- queueMicrotask(() => {
- cleanScheduled = false;
- cleanTextNodes();
- applyDraftIfPossible();
- });
- };
- const observer = new MutationObserver(() => scheduleClean());
+ const observer = new MutationObserver(() => applyDraftIfPossible());
const start = () => {
if (!document.body) return;
- observer.observe(document.body, { childList: true, subtree: true, characterData: true });
- document.addEventListener('click', (event) => {
- const target = event.target instanceof Element ? event.target.closest('button, [role="button"], input[type="submit"]') : null;
- if (!target) return;
- if (isSendLike(target)) {
- emitManualSubmit();
- }
- }, true);
- document.addEventListener('submit', (event) => {
- const form = event.target instanceof HTMLFormElement ? event.target : null;
- const composer = findComposer();
- if (!form || !composer) return;
- if (form.contains(composer)) {
- emitManualSubmit();
- }
- }, true);
- document.addEventListener('keydown', (event) => {
- if (event.key !== 'Enter' || event.shiftKey) return;
- const composer = findComposer();
- if (!composer) return;
- if (event.target === composer) {
- emitManualSubmit();
- }
- }, true);
- scheduleClean();
+ observer.observe(document.body, { childList: true, subtree: true });
+ applyDraftIfPossible();
};
if (document.readyState === 'loading') {
document.addEventListener('DOMContentLoaded', start, { once: true });
@@ -196,24 +74,11 @@ public sealed partial class WebChatWindow : WindowEx
}
window.__openClawTrayVoice = {
setDraft(text) {
- desiredDraft = sanitize(text || '');
- return applyDraftIfPossible();
- },
- prepareManualDraft(text) {
- desiredDraft = sanitize(text || '');
- pendingManual = true;
+ desiredDraft = text || '';
return applyDraftIfPossible();
},
clearDraft() {
desiredDraft = '';
- pendingManual = false;
- return applyDraftIfPossible();
- },
- submitDraft(text) {
- return submitDraft(text);
- },
- stripInjectedMemories() {
- scheduleClean();
return true;
}
};
@@ -252,8 +117,6 @@ private void OnWindowClosed(object sender, WindowEventArgs e)
WebView.CoreWebView2.NavigationCompleted -= _navigationCompletedHandler;
if (_navigationStartingHandler != null)
WebView.CoreWebView2.NavigationStarting -= _navigationStartingHandler;
- if (_webMessageReceivedHandler != null)
- WebView.CoreWebView2.WebMessageReceived -= _webMessageReceivedHandler;
}
}
@@ -282,8 +145,7 @@ private async Task InitializeWebViewAsync()
WebView.CoreWebView2.Settings.IsStatusBarEnabled = false;
WebView.CoreWebView2.Settings.AreDefaultContextMenusEnabled = true;
WebView.CoreWebView2.Settings.IsZoomControlEnabled = true;
- await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(
- BuildTrayVoiceIntegrationScript(_voiceMessageNonce));
+ await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(TrayVoiceIntegrationScript);
// Handle navigation events (store for cleanup)
_navigationCompletedHandler = (s, e) =>
@@ -324,33 +186,6 @@ await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(
};
WebView.CoreWebView2.NavigationStarting += _navigationStartingHandler;
- _webMessageReceivedHandler = (s, e) =>
- {
- try
- {
- if (!TryExtractTrustedVoiceManualSubmit(
- e.TryGetWebMessageAsString(),
- e.Source,
- _trustedVoiceMessageOrigin,
- _voiceMessageNonce,
- out var text))
- {
- return;
- }
-
- VoiceTranscriptSubmitted?.Invoke(this, new VoiceTranscriptSubmittedEventArgs
- {
- Text = text,
- SessionKey = null
- });
- }
- catch (Exception ex)
- {
- Logger.Warn($"WebChatWindow: Failed to process voice web message: {ex.Message}");
- }
- };
- WebView.CoreWebView2.WebMessageReceived += _webMessageReceivedHandler;
-
// Navigate to chat
NavigateToChat();
}
@@ -445,7 +280,6 @@ private void NavigateToChat()
if (!string.IsNullOrEmpty(DEBUG_TEST_URL))
{
Logger.Info($"WebChatWindow: DEBUG MODE - Navigating to test URL: {DEBUG_TEST_URL}");
- _trustedVoiceMessageOrigin = TryGetOrigin(DEBUG_TEST_URL);
WebView.CoreWebView2.Navigate(DEBUG_TEST_URL);
return;
}
@@ -459,67 +293,9 @@ private void NavigateToChat()
var safeBaseUrl = url.Split('?')[0];
Logger.Info($"WebChatWindow: Navigating to {safeBaseUrl} (token hidden)");
- _trustedVoiceMessageOrigin = TryGetOrigin(url);
WebView.CoreWebView2.Navigate(url);
}
- private static string BuildTrayVoiceIntegrationScript(string nonce)
- {
- return TrayVoiceIntegrationScriptTemplate.Replace(
- "__VOICE_NONCE__",
- JsonSerializer.Serialize(nonce),
- StringComparison.Ordinal);
- }
-
- private static bool TryExtractTrustedVoiceManualSubmit(
- string payload,
- string? source,
- string? expectedOrigin,
- string expectedNonce,
- out string text)
- {
- text = string.Empty;
-
- if (!IsTrustedVoiceMessageSource(source, expectedOrigin))
- {
- return false;
- }
-
- using var doc = JsonDocument.Parse(payload);
- if (!doc.RootElement.TryGetProperty("type", out var typeProp) ||
- !string.Equals(typeProp.GetString(), VoiceManualSubmitMessageType, StringComparison.Ordinal))
- {
- return false;
- }
-
- if (!doc.RootElement.TryGetProperty("nonce", out var nonceProp) ||
- !string.Equals(nonceProp.GetString(), expectedNonce, StringComparison.Ordinal))
- {
- return false;
- }
-
- text = doc.RootElement.TryGetProperty("text", out var textProp)
- ? textProp.GetString() ?? string.Empty
- : string.Empty;
-
- return !string.IsNullOrWhiteSpace(text);
- }
-
- private static bool IsTrustedVoiceMessageSource(string? source, string? expectedOrigin)
- {
- var actualOrigin = TryGetOrigin(source);
- return !string.IsNullOrWhiteSpace(expectedOrigin) &&
- !string.IsNullOrWhiteSpace(actualOrigin) &&
- string.Equals(actualOrigin, expectedOrigin, StringComparison.OrdinalIgnoreCase);
- }
-
- private static string? TryGetOrigin(string? url)
- {
- return Uri.TryCreate(url, UriKind.Absolute, out var uri)
- ? uri.GetLeftPart(UriPartial.Authority)
- : null;
- }
-
private void OnHome(object sender, RoutedEventArgs e)
{
NavigateToChat();
@@ -560,51 +336,6 @@ public async Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
await RefreshTrayVoiceDomStateAsync();
}
- public async Task<bool> TrySubmitVoiceTranscriptAsync(string text)
- {
- if (WebView.CoreWebView2 == null)
- {
- return false;
- }
-
- try
- {
- var stopwatch = Stopwatch.StartNew();
- var textJson = JsonSerializer.Serialize(text ?? string.Empty);
- var result = await WebView.CoreWebView2.ExecuteScriptAsync(
- $"window.__openClawTrayVoice?.submitDraft?.({textJson}) ?? false;");
- var submitted = string.Equals(result, "true", StringComparison.OrdinalIgnoreCase);
- Logger.Info($"WebChatWindow: Voice draft submit via chat UI {(submitted ? "succeeded" : "failed")} in {stopwatch.ElapsedMilliseconds}ms");
- return submitted;
- }
- catch (Exception ex)
- {
- Logger.Warn($"WebChatWindow: Failed to submit voice draft through chat UI: {ex.Message}");
- return false;
- }
- }
-
- public async Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text)
- {
- if (WebView.CoreWebView2 == null)
- {
- return false;
- }
-
- try
- {
- var textJson = JsonSerializer.Serialize(text ?? string.Empty);
- var result = await WebView.CoreWebView2.ExecuteScriptAsync(
- $"window.__openClawTrayVoice?.prepareManualDraft?.({textJson}) ?? false;");
- return string.Equals(result, "true", StringComparison.OrdinalIgnoreCase);
- }
- catch (Exception ex)
- {
- Logger.Warn($"WebChatWindow: Failed to prepare manual voice draft: {ex.Message}");
- return false;
- }
- }
-
private async Task RefreshTrayVoiceDomStateAsync()
{
if (WebView.CoreWebView2 == null)
@@ -614,8 +345,6 @@ private async Task RefreshTrayVoiceDomStateAsync()
try
{
- await WebView.CoreWebView2.ExecuteScriptAsync("window.__openClawTrayVoice?.stripInjectedMemories?.();");
-
var draftJson = JsonSerializer.Serialize(_pendingVoiceDraft ?? string.Empty);
var script = string.IsNullOrWhiteSpace(_pendingVoiceDraft)
? "window.__openClawTrayVoice?.clearDraft?.();"
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index d0edc0d..0c58a64 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -43,7 +43,6 @@ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
Assert.Equal("hey_openclaw", settings.VoiceWake.ModelId);
Assert.Equal(0.65f, settings.VoiceWake.TriggerThreshold);
Assert.Equal(250, settings.TalkMode.MinSpeechMs);
- Assert.Equal(VoiceChatWindowSubmitMode.AutoSend, settings.TalkMode.ChatWindowSubmitMode);
}
[Fact]
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index 1edf191..d52efc0 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -54,8 +54,7 @@ public void RoundTrip_AllFields_Preserved()
{
MinSpeechMs = 300,
EndSilenceMs = 1100,
- MaxUtteranceMs = 18000,
- ChatWindowSubmitMode = VoiceChatWindowSubmitMode.WaitForUser
+ MaxUtteranceMs = 18000
}
},
VoiceProviderConfiguration = new VoiceProviderConfigurationStore
@@ -125,7 +124,6 @@ public void RoundTrip_AllFields_Preserved()
Assert.Equal("hey_openclaw", restored.Voice.VoiceWake.ModelId);
Assert.Equal(0.72f, restored.Voice.VoiceWake.TriggerThreshold);
Assert.Equal(300, restored.Voice.TalkMode.MinSpeechMs);
- Assert.Equal(VoiceChatWindowSubmitMode.WaitForUser, restored.Voice.TalkMode.ChatWindowSubmitMode);
Assert.NotNull(restored.VoiceProviderConfiguration);
Assert.Equal("minimax-key", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.ApiKey));
Assert.Equal("speech-2.8-turbo", restored.VoiceProviderConfiguration.GetValue(VoiceProviderIds.MiniMax, VoiceProviderSettingKeys.Model));
diff --git a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
index 62151a7..484a9bc 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
@@ -9,7 +9,7 @@ public class VoiceChatCoordinatorTests
public async Task AttachWindow_ReplaysBufferedDraft()
{
var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
runtime.RaiseDraft("hello world", "main", clear: false);
@@ -22,91 +22,60 @@ public async Task AttachWindow_ReplaysBufferedDraft()
}
[Fact]
- public async Task Submitter_AutoSend_UsesChatWindowSubmit()
+ public async Task DraftClear_IsReplayedWhenWindowAttachesLater()
{
var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
- var window = new FakeVoiceChatWindow { SubmitResult = true };
- coordinator.AttachWindow(window);
-
- var result = await runtime.TranscriptSubmitter!("send this", "main");
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
- Assert.Equal(VoiceTranscriptSubmitOutcome.Submitted, result);
- Assert.Equal(1, window.TrySubmitCallCount);
- Assert.Equal(0, window.PrepareCallCount);
- Assert.Equal("send this", window.LastSubmittedText);
- }
+ runtime.RaiseDraft("temporary draft", "main", clear: false);
+ runtime.RaiseDraft(string.Empty, "main", clear: true);
+ await Task.Yield();
- [Fact]
- public async Task Submitter_WaitForUser_PreparesDraftInsteadOfSending()
- {
- var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.WaitForUser, new ImmediateDispatcher());
- var window = new FakeVoiceChatWindow { PrepareResult = true };
+ var window = new FakeVoiceChatWindow();
coordinator.AttachWindow(window);
+ await Task.Yield();
- var result = await runtime.TranscriptSubmitter!("draft only", "main");
-
- Assert.Equal(VoiceTranscriptSubmitOutcome.DeferredToUser, result);
- Assert.Equal(0, window.TrySubmitCallCount);
- Assert.Equal(1, window.PrepareCallCount);
- Assert.Equal("draft only", window.LastPreparedText);
- }
-
- [Fact]
- public async Task Submitter_WithoutWindow_FallsBackAsUnavailable()
- {
- var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
-
- var result = await runtime.TranscriptSubmitter!("headless", "main");
-
- Assert.Equal(VoiceTranscriptSubmitOutcome.Unavailable, result);
+ Assert.Equal(string.Empty, window.LastDraftText);
+ Assert.True(window.LastDraftClear);
}
[Fact]
- public async Task ManualSubmit_NotifiesRuntime_AndClearsBufferedDraft()
+ public async Task DraftUpdates_AreIgnoredForClosedWindow()
{
var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.WaitForUser, new ImmediateDispatcher());
- var firstWindow = new FakeVoiceChatWindow();
- coordinator.AttachWindow(firstWindow);
-
- runtime.RaiseDraft("working draft", "main", clear: false);
- await Task.Yield();
-
- firstWindow.RaiseSubmitted("final text", "main");
-
- Assert.Equal("final text", runtime.LastManualSubmitText);
- Assert.Equal("main", runtime.LastManualSubmitSessionKey);
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
+ var window = new FakeVoiceChatWindow { IsClosed = true };
+ coordinator.AttachWindow(window);
+ var updateCountAfterAttach = window.UpdateCallCount;
- var secondWindow = new FakeVoiceChatWindow();
- coordinator.AttachWindow(secondWindow);
+ runtime.RaiseDraft("headless text", "main", clear: false);
await Task.Yield();
- Assert.Equal(string.Empty, secondWindow.LastDraftText);
- Assert.True(secondWindow.LastDraftClear);
+ Assert.Equal(updateCountAfterAttach, window.UpdateCallCount);
}
[Fact]
- public void ManualSubmit_AllowsRuntimeToUseCurrentSession_WhenWindowDoesNotSpecifyOne()
+ public async Task DetachWindow_StopsFurtherDraftMirroring()
{
var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.WaitForUser, new ImmediateDispatcher());
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
var window = new FakeVoiceChatWindow();
coordinator.AttachWindow(window);
- window.RaiseSubmitted("follow up", null);
+ coordinator.DetachWindow(window);
+ runtime.RaiseDraft("after detach", "main", clear: false);
+ await Task.Yield();
- Assert.Equal("follow up", runtime.LastManualSubmitText);
- Assert.Null(runtime.LastManualSubmitSessionKey);
+ Assert.Equal(1, window.UpdateCallCount);
+ Assert.Equal(string.Empty, window.LastDraftText);
+ Assert.True(window.LastDraftClear);
}
[Fact]
public void ConversationTurn_IsForwarded()
{
var runtime = new FakeVoiceRuntime();
- using var coordinator = new VoiceChatCoordinator(runtime, () => VoiceChatWindowSubmitMode.AutoSend, new ImmediateDispatcher());
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
VoiceConversationTurnEventArgs? received = null;
coordinator.ConversationTurnAvailable += (_, args) => received = args;
@@ -135,16 +104,6 @@ private sealed class FakeVoiceRuntime : IVoiceRuntime
{
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
public event EventHandler<VoiceTranscriptDraftEventArgs>? TranscriptDraftUpdated;
- public Func<string, string?, Task<VoiceTranscriptSubmitOutcome>>? TranscriptSubmitter { get; set; }
-
- public string? LastManualSubmitText { get; private set; }
- public string? LastManualSubmitSessionKey { get; private set; }
-
- public void NotifyManualTranscriptSubmitted(string text, string? sessionKey = null)
- {
- LastManualSubmitText = text;
- LastManualSubmitSessionKey = sessionKey;
- }
public void RaiseDraft(string text, string? sessionKey, bool clear)
{
@@ -165,45 +124,17 @@ public void RaiseConversationTurn(VoiceConversationTurnEventArgs args)
private sealed class FakeVoiceChatWindow : IVoiceChatWindow
{
public bool IsClosed { get; set; }
- public event EventHandler<VoiceTranscriptSubmittedEventArgs>? VoiceTranscriptSubmitted;
public string LastDraftText { get; private set; } = string.Empty;
public bool LastDraftClear { get; private set; }
- public string? LastSubmittedText { get; private set; }
- public string? LastPreparedText { get; private set; }
- public int TrySubmitCallCount { get; private set; }
- public int PrepareCallCount { get; private set; }
- public bool SubmitResult { get; set; }
- public bool PrepareResult { get; set; }
+ public int UpdateCallCount { get; private set; }
public Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
{
+ UpdateCallCount++;
LastDraftText = text;
LastDraftClear = clear;
return Task.CompletedTask;
}
-
- public Task<bool> TrySubmitVoiceTranscriptAsync(string text)
- {
- TrySubmitCallCount++;
- LastSubmittedText = text;
- return Task.FromResult(SubmitResult);
- }
-
- public Task<bool> PrepareVoiceTranscriptForManualSendAsync(string text)
- {
- PrepareCallCount++;
- LastPreparedText = text;
- return Task.FromResult(PrepareResult);
- }
-
- public void RaiseSubmitted(string text, string? sessionKey)
- {
- VoiceTranscriptSubmitted?.Invoke(this, new VoiceTranscriptSubmittedEventArgs
- {
- Text = text,
- SessionKey = sessionKey
- });
- }
}
}
diff --git a/tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs b/tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs
deleted file mode 100644
index ea07286..0000000
--- a/tests/OpenClaw.Tray.Tests/WebChatWindowSecurityTests.cs
+++ /dev/null
@@ -1,71 +0,0 @@
-using System.Reflection;
-using OpenClawTray.Windows;
-
-namespace OpenClaw.Tray.Tests;
-
-public class WebChatWindowSecurityTests
-{
- [Fact]
- public void TrustedVoiceSubmit_AllowsExpectedOriginAndNonce()
- {
- var method = GetTrustedSubmitMethod();
- var arguments = new object?[]
- {
- """{"type":"voice-manual-submit","text":"hello world","nonce":"expected-nonce"}""",
- "https://chat.example.test/path?x=1",
- "https://chat.example.test",
- "expected-nonce",
- null
- };
-
- var accepted = (bool)method.Invoke(null, arguments)!;
-
- Assert.True(accepted);
- Assert.Equal("hello world", arguments[4]);
- }
-
- [Fact]
- public void TrustedVoiceSubmit_RejectsUnexpectedOrigin()
- {
- var method = GetTrustedSubmitMethod();
- var arguments = new object?[]
- {
- """{"type":"voice-manual-submit","text":"hello world","nonce":"expected-nonce"}""",
- "https://evil.example.test/",
- "https://chat.example.test",
- "expected-nonce",
- null
- };
-
- var accepted = (bool)method.Invoke(null, arguments)!;
-
- Assert.False(accepted);
- Assert.Equal(string.Empty, arguments[4]);
- }
-
- [Fact]
- public void TrustedVoiceSubmit_RejectsUnexpectedNonce()
- {
- var method = GetTrustedSubmitMethod();
- var arguments = new object?[]
- {
- """{"type":"voice-manual-submit","text":"hello world","nonce":"wrong-nonce"}""",
- "https://chat.example.test/",
- "https://chat.example.test",
- "expected-nonce",
- null
- };
-
- var accepted = (bool)method.Invoke(null, arguments)!;
-
- Assert.False(accepted);
- Assert.Equal(string.Empty, arguments[4]);
- }
-
- private static MethodInfo GetTrustedSubmitMethod()
- {
- return typeof(WebChatWindow).GetMethod(
- "TryExtractTrustedVoiceManualSubmit",
- BindingFlags.NonPublic | BindingFlags.Static)!;
- }
-}
From 7b8f118c1cc9cd5342d3dd8f4d86b1bd74fb969a Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 15:52:45 +0000
Subject: [PATCH 38/83] Add configurable tray chat memory stripping
---
docs/VOICE-MODE.md | 3 +
src/OpenClaw.Shared/VoiceModeSchema.cs | 1 +
src/OpenClaw.Tray.WinUI/App.xaml.cs | 25 ++++++--
.../Controls/VoiceSettingsPanel.xaml | 2 +
.../Controls/VoiceSettingsPanel.xaml.cs | 4 +-
.../Services/Voice/VoiceService.cs | 1 +
.../Windows/WebChatWindow.xaml.cs | 59 +++++++++++++++++--
.../VoiceModeSchemaTests.cs | 1 +
.../SettingsRoundTripTests.cs | 4 ++
9 files changed, 90 insertions(+), 10 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 118a150..391293d 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -57,6 +57,7 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- the node performs local TTS playback of that reply
- assistant replies are queued locally and spoken sequentially, with a short 500 ms pause between queued replies so overlapping responses are not lost
- if a reply arrives after the normal 45-second wait timeout, the tray still accepts and speaks that late reply for a short bounded grace window so slow upstream responses are not silently lost
+- the tray chat window can optionally strip injected `<relevant-memories>...</relevant-memories>` blocks from the rendered display without changing the underlying upstream message
To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
@@ -502,6 +503,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
|---|---|---|---|---|
| `Voice.Mode` | enum | `Off` | all | Activation mode: `Off`, `VoiceWake`, `TalkMode` |
| `Voice.Enabled` | bool | `false` | all | Master enable/disable flag for voice mode |
+| `Voice.StripInjectedMemoriesInChat` | bool | `true` | all | If `true`, the tray chat window strips injected `<relevant-memories>` scaffolding from rendered chat text |
| `Voice.SpeechToTextProviderId` | string | `windows` | all | Preferred speech-to-text provider id |
| `Voice.TextToSpeechProviderId` | string | `windows` | all | Preferred text-to-speech provider id |
| `Voice.InputDeviceId` | string? | `null` | all | Preferred microphone device id; `null` means system default |
@@ -750,3 +752,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Delayed the Talk Mode ready state until recognizer warm-up completes, so the UI does not advertise listening before the first recognition session has settled.
- `2026-03-25` Added a recognizer health-check watchdog so Talk Mode recycles a started-but-deaf recognition session instead of waiting minutes for Windows to cancel it.
- `2026-03-25` Reverted Talk Mode to a single direct `chat.send` path and reduced the tray chat integration back to draft mirroring only.
+- `2026-03-25` Added a configurable tray-chat display filter for injected `<relevant-memories>` blocks.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 3fe9d1b..6c52886 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -62,6 +62,7 @@ public sealed class VoiceSettings
public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
public bool Enabled { get; set; }
public bool ShowConversationToasts { get; set; }
+ public bool StripInjectedMemoriesInChat { get; set; } = true;
public string SpeechToTextProviderId { get; set; } = VoiceProviderIds.Windows;
public string TextToSpeechProviderId { get; set; } = VoiceProviderIds.Windows;
public string? InputDeviceId { get; set; }
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 9e90f64..818847e 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -1786,10 +1786,10 @@ private async void OnSettingsSaved(object? sender, EventArgs e)
}
}
- if (_settings?.EnableNodeMode == true)
- {
- InitializeNodeService();
- }
+ if (_settings?.EnableNodeMode == true)
+ {
+ InitializeNodeService();
+ }
else
{
InitializeGatewayClient();
@@ -1819,6 +1819,18 @@ private async void OnSettingsSaved(object? sender, EventArgs e)
_globalHotkey?.Unregister();
}
+ if (_webChatWindow != null && ! _webChatWindow.IsClosed)
+ {
+ try
+ {
+ await _webChatWindow.SetStripInjectedMemoriesEnabledAsync(_settings.Voice.StripInjectedMemoriesInChat);
+ }
+ catch (Exception ex)
+ {
+ Logger.Warn($"Failed to refresh tray chat cleanup setting: {ex.Message}");
+ }
+ }
+
// Update auto-start
AutoStartManager.SetAutoStart(_settings.AutoStart);
}
@@ -1827,7 +1839,10 @@ private void ShowWebChat()
{
if (_webChatWindow == null || _webChatWindow.IsClosed)
{
- _webChatWindow = new WebChatWindow(_settings!.GatewayUrl, _settings.Token);
+ _webChatWindow = new WebChatWindow(
+ _settings!.GatewayUrl,
+ _settings.Token,
+ _settings.Voice.StripInjectedMemoriesInChat);
_webChatWindow.Closed += (s, e) =>
{
_voiceChatCoordinator?.DetachWindow(_webChatWindow);
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index 5277f37..7353299 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -57,6 +57,8 @@
<CheckBox x:Name="VoiceConversationToastsCheckBox"
Content="Show voice transcripts and replies as toasts"/>
+ <CheckBox x:Name="VoiceStripInjectedMemoriesCheckBox"
+ Content="Hide injected &lt;relevant-memories&gt; blocks in the tray chat window"/>
<TextBlock x:Name="VoiceSettingsInfoTextBlock"
Style="{StaticResource CaptionTextBlockStyle}"
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 103b31d..ce985be 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -46,6 +46,7 @@ public async Task ApplyAsync(SettingsManager settings)
Mode = GetSelectedVoiceMode(),
Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
+ StripInjectedMemoriesInChat = VoiceStripInjectedMemoriesCheckBox.IsChecked ?? true,
SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows,
TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows,
InputDeviceId = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
@@ -94,6 +95,7 @@ private void LoadVoiceSettings()
LoadVoiceProviders();
SelectVoiceMode(_settings.Voice.Mode);
VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
+ VoiceStripInjectedMemoriesCheckBox.IsChecked = _settings.Voice.StripInjectedMemoriesInChat;
UpdateVoiceProviderSettingsEditor();
UpdateVoiceSettingsInfo();
}
@@ -209,7 +211,7 @@ private void UpdateVoiceSettingsInfo()
}
VoiceSettingsInfoTextBlock.Text =
- $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}.{fallbackNotice}";
+ $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}. Chat cleanup: {(VoiceStripInjectedMemoriesCheckBox.IsChecked ?? true ? "on" : "off")}.{fallbackNotice}";
}
private void UpdateVoiceProviderSettingsEditor()
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index e063ce2..9c216b8 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1788,6 +1788,7 @@ private static VoiceSettings Clone(VoiceSettings source)
Mode = source.Mode,
Enabled = source.Enabled,
ShowConversationToasts = source.ShowConversationToasts,
+ StripInjectedMemoriesInChat = source.StripInjectedMemoriesInChat,
SpeechToTextProviderId = source.SpeechToTextProviderId,
TextToSpeechProviderId = source.TextToSpeechProviderId,
InputDeviceId = source.InputDeviceId,
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index c660697..1e5bd9c 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -20,6 +20,7 @@ public sealed partial class WebChatWindow : WindowEx
{
private readonly string _gatewayUrl;
private readonly string _token;
+ private bool _stripInjectedMemories;
private string _pendingVoiceDraft = string.Empty;
// Store event handlers for cleanup
@@ -28,10 +29,12 @@ public sealed partial class WebChatWindow : WindowEx
public bool IsClosed { get; private set; }
- private const string TrayVoiceIntegrationScript = """
+private const string TrayVoiceIntegrationScript = """
(() => {
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
+ const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
let desiredDraft = '';
+ let stripInjectedMemories = true;
const findComposer = () => {
const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
return candidates.find(isVisible) || null;
@@ -61,11 +64,43 @@ public sealed partial class WebChatWindow : WindowEx
setElementValue(composer, desiredDraft);
return true;
};
- const observer = new MutationObserver(() => applyDraftIfPossible());
+ const cleanTextNodes = () => {
+ if (!stripInjectedMemories || !document.body) return false;
+ const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
+ const nodes = [];
+ let current;
+ while ((current = walker.nextNode())) {
+ nodes.push(current);
+ }
+ let changed = false;
+ for (const node of nodes) {
+ if (!node || !node.parentElement) continue;
+ const tag = node.parentElement.tagName;
+ if (tag === 'SCRIPT' || tag === 'STYLE' || tag === 'TEXTAREA') continue;
+ const original = node.textContent || '';
+ const cleaned = original.replace(memoryPattern, '').trimStart();
+ if (cleaned !== original) {
+ node.textContent = cleaned;
+ changed = true;
+ }
+ }
+ return changed;
+ };
+ let refreshScheduled = false;
+ const refreshView = () => {
+ if (refreshScheduled) return;
+ refreshScheduled = true;
+ queueMicrotask(() => {
+ refreshScheduled = false;
+ cleanTextNodes();
+ applyDraftIfPossible();
+ });
+ };
+ const observer = new MutationObserver(() => refreshView());
const start = () => {
if (!document.body) return;
observer.observe(document.body, { childList: true, subtree: true });
- applyDraftIfPossible();
+ refreshView();
};
if (document.readyState === 'loading') {
document.addEventListener('DOMContentLoaded', start, { once: true });
@@ -77,6 +112,11 @@ public sealed partial class WebChatWindow : WindowEx
desiredDraft = text || '';
return applyDraftIfPossible();
},
+ setStripInjectedMemories(enabled) {
+ stripInjectedMemories = !!enabled;
+ refreshView();
+ return true;
+ },
clearDraft() {
desiredDraft = '';
return true;
@@ -85,11 +125,12 @@ public sealed partial class WebChatWindow : WindowEx
})();
""";
- public WebChatWindow(string gatewayUrl, string token)
+ public WebChatWindow(string gatewayUrl, string token, bool stripInjectedMemories)
{
Logger.Info($"WebChatWindow: Constructor called, gateway={gatewayUrl}");
_gatewayUrl = gatewayUrl;
_token = token;
+ _stripInjectedMemories = stripInjectedMemories;
InitializeComponent();
@@ -336,6 +377,12 @@ public async Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
await RefreshTrayVoiceDomStateAsync();
}
+ public async Task SetStripInjectedMemoriesEnabledAsync(bool enabled)
+ {
+ _stripInjectedMemories = enabled;
+ await RefreshTrayVoiceDomStateAsync();
+ }
+
private async Task RefreshTrayVoiceDomStateAsync()
{
if (WebView.CoreWebView2 == null)
@@ -345,6 +392,10 @@ private async Task RefreshTrayVoiceDomStateAsync()
try
{
+ var stripJson = _stripInjectedMemories ? "true" : "false";
+ await WebView.CoreWebView2.ExecuteScriptAsync(
+ $"window.__openClawTrayVoice?.setStripInjectedMemories?.({stripJson});");
+
var draftJson = JsonSerializer.Serialize(_pendingVoiceDraft ?? string.Empty);
var script = string.IsNullOrWhiteSpace(_pendingVoiceDraft)
? "window.__openClawTrayVoice?.clearDraft?.();"
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 0c58a64..70047cf 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -34,6 +34,7 @@ public void VoiceSettings_Defaults_AreConcreteAndProviderAgnostic()
Assert.False(settings.Enabled);
Assert.Equal(VoiceActivationMode.Off, settings.Mode);
Assert.False(settings.ShowConversationToasts);
+ Assert.True(settings.StripInjectedMemoriesInChat);
Assert.Equal(VoiceProviderIds.Windows, settings.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.TextToSpeechProviderId);
Assert.Equal(16000, settings.SampleRateHz);
diff --git a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
index d52efc0..2b4f5a1 100644
--- a/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
+++ b/tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs
@@ -34,6 +34,7 @@ public void RoundTrip_AllFields_Preserved()
Enabled = true,
Mode = VoiceActivationMode.VoiceWake,
ShowConversationToasts = true,
+ StripInjectedMemoriesInChat = false,
SpeechToTextProviderId = "windows",
TextToSpeechProviderId = "elevenlabs",
InputDeviceId = "mic-1",
@@ -116,6 +117,7 @@ public void RoundTrip_AllFields_Preserved()
Assert.True(restored.Voice.Enabled);
Assert.Equal(VoiceActivationMode.VoiceWake, restored.Voice.Mode);
Assert.True(restored.Voice.ShowConversationToasts);
+ Assert.False(restored.Voice.StripInjectedMemoriesInChat);
Assert.Equal("windows", restored.Voice.SpeechToTextProviderId);
Assert.Equal("elevenlabs", restored.Voice.TextToSpeechProviderId);
Assert.Equal("mic-1", restored.Voice.InputDeviceId);
@@ -181,6 +183,7 @@ public void MissingFields_UseDefaults()
Assert.False(settings.Voice.Enabled);
Assert.Equal(VoiceActivationMode.Off, settings.Voice.Mode);
Assert.False(settings.Voice.ShowConversationToasts);
+ Assert.True(settings.Voice.StripInjectedMemoriesInChat);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
Assert.NotNull(settings.VoiceProviderConfiguration);
@@ -249,6 +252,7 @@ public void BackwardCompatibility_OldSettingsWithoutNewFields()
Assert.False(settings.Voice.Enabled);
Assert.Equal(VoiceActivationMode.Off, settings.Voice.Mode);
Assert.False(settings.Voice.ShowConversationToasts);
+ Assert.True(settings.Voice.StripInjectedMemoriesInChat);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.SpeechToTextProviderId);
Assert.Equal(VoiceProviderIds.Windows, settings.Voice.TextToSpeechProviderId);
Assert.Null(settings.UserRules);
From c212c3f0343166f698f1a802445c72862b26f96d Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 16:19:10 +0000
Subject: [PATCH 39/83] Fix stalled talk mode recognizer recycle
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 48 ++++++++++++++++---
.../VoiceServiceTransportTests.cs | 26 ++++++++++
3 files changed, 68 insertions(+), 7 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 391293d..7cab69e 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -753,3 +753,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Added a recognizer health-check watchdog so Talk Mode recycles a started-but-deaf recognition session instead of waiting minutes for Windows to cancel it.
- `2026-03-25` Reverted Talk Mode to a single direct `chat.send` path and reduced the tray chat integration back to draft mirroring only.
- `2026-03-25` Added a configurable tray-chat display filter for injected `<relevant-memories>` blocks.
+- `2026-03-25` Fixed the recognizer watchdog so a stalled Talk Mode session is actually canceled and restarted instead of logging a recycle and then remaining deaf.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 9c216b8..ff50023 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -52,6 +52,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private bool _recognitionActive;
private int _recognitionSessionGeneration;
private bool _recognitionHealthCheckArmed;
+ private bool _recognitionRestartInProgress;
private bool _awaitingReply;
private bool _isSpeaking;
private bool _replyPlaybackLoopActive;
@@ -1355,6 +1356,20 @@ internal static bool ShouldAcceptLateAssistantReply(
utcNow <= lateReplyGraceUntilUtc.Value;
}
+ internal static bool ShouldRestartRecognitionAfterCompletion(
+ bool running,
+ VoiceActivationMode mode,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking)
+ {
+ return running &&
+ mode == VoiceActivationMode.TalkMode &&
+ !restartInProgress &&
+ !awaitingReply &&
+ !isSpeaking;
+ }
+
private static string CreateReplyPreview(string text)
{
var trimmed = text.Trim();
@@ -1412,6 +1427,7 @@ private async void OnSpeechRecognitionCompleted(
{
CancellationToken token;
var shouldRestart = false;
+ var restartInProgress = false;
lock (_gate)
{
@@ -1421,14 +1437,25 @@ private async void OnSpeechRecognitionCompleted(
}
_recognitionActive = false;
- _recognitionHealthCheckArmed =
- args.Status == SpeechRecognitionResultStatus.UserCanceled ||
- args.Status == SpeechRecognitionResultStatus.TimeoutExceeded;
+ restartInProgress = _recognitionRestartInProgress;
+ if (restartInProgress)
+ {
+ _recognitionRestartInProgress = false;
+ _recognitionHealthCheckArmed = false;
+ }
+ else
+ {
+ _recognitionHealthCheckArmed =
+ args.Status == SpeechRecognitionResultStatus.UserCanceled ||
+ args.Status == SpeechRecognitionResultStatus.TimeoutExceeded;
+ }
token = _runtimeCts.Token;
- shouldRestart = _status.Running &&
- _status.Mode == VoiceActivationMode.TalkMode &&
- !_awaitingReply &&
- !_isSpeaking;
+ shouldRestart = ShouldRestartRecognitionAfterCompletion(
+ _status.Running,
+ _status.Mode,
+ restartInProgress,
+ _awaitingReply,
+ _isSpeaking);
}
_logger.Warn($"Speech recognition session completed with status {args.Status}; restart={shouldRestart}");
@@ -1521,6 +1548,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
recognizer = _speechRecognizer;
_speechRecognizer = null;
_recognitionActive = false;
+ _recognitionRestartInProgress = false;
synthesizer = _speechSynthesizer;
_speechSynthesizer = null;
@@ -1702,6 +1730,7 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
if (shouldRecycle)
{
+ _recognitionRestartInProgress = true;
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
@@ -1717,6 +1746,7 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
_logger.Warn(
$"Speech recognition session produced no hypotheses/results within {RecognitionHealthCheckDelay.TotalSeconds:0}s; recycling session");
+ await StopRecognitionSessionAsync();
await ResumeRecognitionSessionAsync(cancellationToken, "recognition health check");
}
catch (OperationCanceledException)
@@ -1724,6 +1754,10 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
}
catch (Exception ex)
{
+ lock (_gate)
+ {
+ _recognitionRestartInProgress = false;
+ }
_logger.Warn($"Speech recognition health check failed: {ex.Message}");
}
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index ce41a37..b21940e 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -123,6 +123,32 @@ public void ShouldAcceptLateAssistantReply_OnlyMatchesBoundedGraceWindow(
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(true, false, false)]
+ [InlineData(false, true, false)]
+ [InlineData(false, false, true)]
+ public void ShouldRestartRecognitionAfterCompletion_SuppressesControlledRecycle(
+ bool restartInProgress,
+ bool awaitingReply,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldRestartRecognitionAfterCompletion",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(
+ null,
+ [
+ true,
+ VoiceActivationMode.TalkMode,
+ restartInProgress,
+ awaitingReply,
+ false
+ ])!;
+
+ Assert.Equal(expected, result);
+ }
+
private static MethodInfo GetMethod()
{
return typeof(VoiceService).GetMethod(
From e07234b97ef03b76d6a2595900e42ffd37a778a6 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 16:33:42 +0000
Subject: [PATCH 40/83] Rebuild talk mode recognizer after deaf sessions
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 116 +++++++++++++++++-
.../VoiceServiceTransportTests.cs | 28 +++++
3 files changed, 141 insertions(+), 4 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 7cab69e..cf26532 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -754,3 +754,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Reverted Talk Mode to a single direct `chat.send` path and reduced the tray chat integration back to draft mirroring only.
- `2026-03-25` Added a configurable tray-chat display filter for injected `<relevant-memories>` blocks.
- `2026-03-25` Fixed the recognizer watchdog so a stalled Talk Mode session is actually canceled and restarted instead of logging a recycle and then remaining deaf.
+- `2026-03-25` Rebuilt the Windows speech recognizer after repeated deaf `UserCanceled` and watchdog-recycle failures instead of repeatedly restarting the same broken recognizer instance.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index ff50023..730126b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -51,6 +51,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private MediaPlayer? _mediaPlayer;
private bool _recognitionActive;
private int _recognitionSessionGeneration;
+ private bool _recognitionSessionHadActivity;
private bool _recognitionHealthCheckArmed;
private bool _recognitionRestartInProgress;
private bool _awaitingReply;
@@ -718,6 +719,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
lock (_gate)
{
_recognitionActive = true;
+ _recognitionSessionHadActivity = false;
if (updateListeningStatus && _status.Running && !_awaitingReply && !_isSpeaking)
{
_status = BuildRunningStatus(
@@ -735,7 +737,8 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
private async Task ResumeRecognitionSessionAsync(
CancellationToken cancellationToken,
string reason,
- string? lastError = null)
+ string? lastError = null,
+ bool rebuildRecognizer = false)
{
const int maxAttempts = 2;
string? currentError = lastError;
@@ -746,6 +749,11 @@ private async Task ResumeRecognitionSessionAsync(
try
{
+ if (rebuildRecognizer && attempt == 1)
+ {
+ await RebuildSpeechRecognizerAsync(reason, cancellationToken);
+ }
+
await StartRecognitionSessionAsync();
return;
}
@@ -800,6 +808,7 @@ private async Task StopRecognitionSessionAsync()
}
_recognitionActive = false;
+ _recognitionHealthCheckArmed = false;
}
try
@@ -894,6 +903,7 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
text = args.Hypothesis?.Text?.Trim();
sessionKey = GetCurrentVoiceSessionKey();
+ _recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
if (_status.State != VoiceRuntimeState.RecordingUtterance)
{
@@ -943,6 +953,7 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscript = text;
_lastTranscriptUtc = DateTime.UtcNow;
+ _recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
cancellationToken = _runtimeCts.Token;
sessionKey = GetCurrentVoiceSessionKey();
@@ -1370,6 +1381,22 @@ internal static bool ShouldRestartRecognitionAfterCompletion(
!isSpeaking;
}
+ internal static bool ShouldRebuildRecognitionAfterCompletion(
+ SpeechRecognitionResultStatus status,
+ bool sessionHadActivity,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking)
+ {
+ if (restartInProgress || awaitingReply || isSpeaking || sessionHadActivity)
+ {
+ return false;
+ }
+
+ return status == SpeechRecognitionResultStatus.UserCanceled ||
+ status == SpeechRecognitionResultStatus.TimeoutExceeded;
+ }
+
private static string CreateReplyPreview(string text)
{
var trimmed = text.Trim();
@@ -1427,7 +1454,9 @@ private async void OnSpeechRecognitionCompleted(
{
CancellationToken token;
var shouldRestart = false;
+ var shouldRebuildRecognizer = false;
var restartInProgress = false;
+ var sessionHadActivity = false;
lock (_gate)
{
@@ -1437,6 +1466,8 @@ private async void OnSpeechRecognitionCompleted(
}
_recognitionActive = false;
+ sessionHadActivity = _recognitionSessionHadActivity;
+ _recognitionSessionHadActivity = false;
restartInProgress = _recognitionRestartInProgress;
if (restartInProgress)
{
@@ -1456,14 +1487,24 @@ private async void OnSpeechRecognitionCompleted(
restartInProgress,
_awaitingReply,
_isSpeaking);
+ shouldRebuildRecognizer = ShouldRebuildRecognitionAfterCompletion(
+ args.Status,
+ sessionHadActivity,
+ restartInProgress,
+ _awaitingReply,
+ _isSpeaking);
}
- _logger.Warn($"Speech recognition session completed with status {args.Status}; restart={shouldRestart}");
+ _logger.Warn(
+ $"Speech recognition session completed with status {args.Status}; restart={shouldRestart}; rebuild={shouldRebuildRecognizer}; hadActivity={sessionHadActivity}");
if (shouldRestart && !token.IsCancellationRequested)
{
await Task.Delay(250, token);
- await ResumeRecognitionSessionAsync(token, $"recognition completed ({args.Status})");
+ await ResumeRecognitionSessionAsync(
+ token,
+ $"recognition completed ({args.Status})",
+ rebuildRecognizer: shouldRebuildRecognizer);
}
}
catch (OperationCanceledException)
@@ -1548,6 +1589,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
recognizer = _speechRecognizer;
_speechRecognizer = null;
_recognitionActive = false;
+ _recognitionSessionHadActivity = false;
_recognitionRestartInProgress = false;
synthesizer = _speechSynthesizer;
@@ -1708,6 +1750,69 @@ private void ArmRecognitionHealthCheck()
}
}
+ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken cancellationToken)
+ {
+ SpeechRecognizer? oldRecognizer;
+ SpeechRecognizer? newRecognizer = null;
+ VoiceSettings settings;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _runtimeCts.IsCancellationRequested)
+ {
+ return;
+ }
+
+ oldRecognizer = _speechRecognizer;
+ settings = Clone(_settings.Voice);
+ _speechRecognizer = null;
+ _recognitionActive = false;
+ _recognitionSessionHadActivity = false;
+ _recognitionHealthCheckArmed = false;
+ }
+
+ if (oldRecognizer != null)
+ {
+ try { oldRecognizer.HypothesisGenerated -= OnSpeechHypothesisGenerated; } catch { }
+ try { oldRecognizer.ContinuousRecognitionSession.ResultGenerated -= OnSpeechResultGenerated; } catch { }
+ try { oldRecognizer.ContinuousRecognitionSession.Completed -= OnSpeechRecognitionCompleted; } catch { }
+ try { await oldRecognizer.ContinuousRecognitionSession.CancelAsync(); } catch { }
+ try { oldRecognizer.Dispose(); } catch { }
+ }
+
+ try
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+ newRecognizer = await CreateSpeechRecognizerAsync(settings);
+ newRecognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
+ newRecognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
+ newRecognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _runtimeCts.IsCancellationRequested || !_status.Running)
+ {
+ return;
+ }
+
+ _speechRecognizer = newRecognizer;
+ newRecognizer = null;
+ }
+
+ _logger.Warn($"Speech recognizer rebuilt ({reason})");
+ }
+ finally
+ {
+ if (newRecognizer != null)
+ {
+ try { newRecognizer.HypothesisGenerated -= OnSpeechHypothesisGenerated; } catch { }
+ try { newRecognizer.ContinuousRecognitionSession.ResultGenerated -= OnSpeechResultGenerated; } catch { }
+ try { newRecognizer.ContinuousRecognitionSession.Completed -= OnSpeechRecognitionCompleted; } catch { }
+ try { newRecognizer.Dispose(); } catch { }
+ }
+ }
+ }
+
private async Task MonitorRecognitionSessionHealthAsync(int generation, CancellationToken cancellationToken)
{
try
@@ -1747,7 +1852,10 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
_logger.Warn(
$"Speech recognition session produced no hypotheses/results within {RecognitionHealthCheckDelay.TotalSeconds:0}s; recycling session");
await StopRecognitionSessionAsync();
- await ResumeRecognitionSessionAsync(cancellationToken, "recognition health check");
+ await ResumeRecognitionSessionAsync(
+ cancellationToken,
+ "recognition health check",
+ rebuildRecognizer: true);
}
catch (OperationCanceledException)
{
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index b21940e..4d8bf6d 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -1,6 +1,7 @@
using System.Reflection;
using OpenClaw.Shared;
using OpenClawTray.Services.Voice;
+using Windows.Media.SpeechRecognition;
namespace OpenClaw.Tray.Tests;
@@ -149,6 +150,33 @@ public void ShouldRestartRecognitionAfterCompletion_SuppressesControlledRecycle(
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, true, false, false, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, true, false, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, true, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, true, false)]
+ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledSessions(
+ SpeechRecognitionResultStatus status,
+ bool sessionHadActivity,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldRebuildRecognitionAfterCompletion",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(
+ null,
+ [status, sessionHadActivity, restartInProgress, awaitingReply, isSpeaking])!;
+
+ Assert.Equal(expected, result);
+ }
+
private static MethodInfo GetMethod()
{
return typeof(VoiceService).GetMethod(
From 3c6224415cd237c2ac83149594193bb602f01064 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 16:42:12 +0000
Subject: [PATCH 41/83] Fix voice draft clearing and playback start
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 14 +++++++++++++-
.../Windows/WebChatWindow.xaml.cs | 2 +-
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index cf26532..a216dba 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -755,3 +755,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Added a configurable tray-chat display filter for injected `<relevant-memories>` blocks.
- `2026-03-25` Fixed the recognizer watchdog so a stalled Talk Mode session is actually canceled and restarted instead of logging a recycle and then remaining deaf.
- `2026-03-25` Rebuilt the Windows speech recognizer after repeated deaf `UserCanceled` and watchdog-recycle failures instead of repeatedly restarting the same broken recognizer instance.
+- `2026-03-25` Fixed the tray-chat draft mirror so it clears immediately after direct send, and primed media playback before `Play()` so spoken replies stop clipping their opening syllables.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 730126b..767560b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1415,31 +1415,43 @@ private static async Task PlayStreamAsync(
CancellationToken cancellationToken)
{
stream.Seek(0);
+ var mediaOpened = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
var playbackEnded = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ TypedEventHandler<MediaPlayer, object>? openedHandler = null;
TypedEventHandler<MediaPlayer, object>? endedHandler = null;
TypedEventHandler<MediaPlayer, MediaPlayerFailedEventArgs>? failedHandler = null;
+ openedHandler = (sender, _) => mediaOpened.TrySetResult(true);
endedHandler = (sender, _) => playbackEnded.TrySetResult(true);
- failedHandler = (sender, args) => playbackEnded.TrySetException(new InvalidOperationException(args.ErrorMessage));
+ failedHandler = (sender, args) =>
+ {
+ var exception = new InvalidOperationException(args.ErrorMessage);
+ mediaOpened.TrySetException(exception);
+ playbackEnded.TrySetException(exception);
+ };
+ player.MediaOpened += openedHandler;
player.MediaEnded += endedHandler;
player.MediaFailed += failedHandler;
using var registration = cancellationToken.Register(() =>
{
try { player.Pause(); } catch { }
try { player.Source = null; } catch { }
+ mediaOpened.TrySetCanceled(cancellationToken);
playbackEnded.TrySetCanceled(cancellationToken);
});
try
{
player.Source = MediaSource.CreateFromStream(stream, contentType);
+ await mediaOpened.Task;
player.Play();
await playbackEnded.Task;
}
finally
{
+ player.MediaOpened -= openedHandler;
player.MediaEnded -= endedHandler;
player.MediaFailed -= failedHandler;
player.Source = null;
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 1e5bd9c..661013c 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -119,7 +119,7 @@ public sealed partial class WebChatWindow : WindowEx
},
clearDraft() {
desiredDraft = '';
- return true;
+ return applyDraftIfPossible();
}
};
})();
From 40383a1f75d5962b0d079c6490b4ac9690875276 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 16:44:20 +0000
Subject: [PATCH 42/83] Add streaming playback backlog story
---
docs/VOICE-MODE.md | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index a216dba..d21eea1 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -712,6 +712,19 @@ Notes:
- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
- if implemented later, the strip should use the shared runtime control API described elsewhere in this document.
+### Story: True streaming TTS playback
+
+Start speaking assistant replies from the first usable audio chunk instead of waiting for a complete playable stream.
+
+Notes:
+
+- the current implementation uses WebSocket transport for MiniMax, but still buffers the entire audio response before playback begins
+- `firstChunk=...ms` in the log is currently provider-chunk arrival time, not actual speech-start time
+- implement a playback path that can consume incremental audio data as it arrives from the provider
+- the provider catalog contract should remain transport-driven and provider-agnostic, so streaming behavior should be expressed through the existing TTS contract model rather than hard-coded for MiniMax
+- preserve the existing queued reply behavior, skip support, and late-reply handling while switching playback to progressive output
+- add timing logs that separate `firstChunk`, `playbackStart`, and `playbackEnd` so latency improvements are measurable
+
## Commit Timeline
Append one new line to this timeline for every future voice-mode commit.
@@ -756,3 +769,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Fixed the recognizer watchdog so a stalled Talk Mode session is actually canceled and restarted instead of logging a recycle and then remaining deaf.
- `2026-03-25` Rebuilt the Windows speech recognizer after repeated deaf `UserCanceled` and watchdog-recycle failures instead of repeatedly restarting the same broken recognizer instance.
- `2026-03-25` Fixed the tray-chat draft mirror so it clears immediately after direct send, and primed media playback before `Play()` so spoken replies stop clipping their opening syllables.
+- `2026-03-25` Added a backlog story for true streaming TTS playback, including provider-catalog and latency-measurement notes.
From dcddec2bdd9e35c981996017ec5c5097b9efadbb Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 16:56:09 +0000
Subject: [PATCH 43/83] Fix truncated voice input and MiniMax output
---
docs/VOICE-MODE.md | 1 +
.../Voice/VoiceCloudTextToSpeechClient.cs | 16 ++---
.../Services/Voice/VoiceService.cs | 62 +++++++++++++++++++
.../VoiceServiceTransportTests.cs | 24 +++++++
4 files changed, 92 insertions(+), 11 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index d21eea1..98750cc 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -770,3 +770,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Rebuilt the Windows speech recognizer after repeated deaf `UserCanceled` and watchdog-recycle failures instead of repeatedly restarting the same broken recognizer instance.
- `2026-03-25` Fixed the tray-chat draft mirror so it clears immediately after direct send, and primed media playback before `Play()` so spoken replies stop clipping their opening syllables.
- `2026-03-25` Added a backlog story for true streaming TTS playback, including provider-catalog and latency-measurement notes.
+- `2026-03-25` Corrected the MiniMax WebSocket request sequence by sending `task_finish` before reading audio, and added a guarded fallback that promotes a recent longer hypothesis when Windows only finalizes the tail of an utterance.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index 859b6c8..ad36486 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -106,6 +106,11 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
var continueMessage = ApplyJsonTemplate(contract.ContinueMessageTemplate, templateValues);
await SendTextMessageAsync(socket, continueMessage);
+ if (!string.IsNullOrWhiteSpace(contract.FinishMessageTemplate))
+ {
+ await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues));
+ }
+
var audioBytes = new List<byte>();
long? firstChunkMs = null;
@@ -131,17 +136,6 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
}
}
- if (!string.IsNullOrWhiteSpace(contract.FinishMessageTemplate))
- {
- try
- {
- await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues));
- }
- catch
- {
- }
- }
-
try
{
await socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "done", CancellationToken.None);
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 767560b..636c3e7 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -32,6 +32,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan RecognitionHealthCheckDelay = TimeSpan.FromSeconds(15);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
+ private static readonly TimeSpan HypothesisPromotionWindow = TimeSpan.FromSeconds(2);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
@@ -60,6 +61,8 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private bool _quickPaused;
private string? _lastTranscript;
private DateTime _lastTranscriptUtc;
+ private string? _lastHypothesisText;
+ private DateTime _lastHypothesisUtc;
private readonly Queue<(string Text, string? SessionKey)> _pendingAssistantReplies = new();
private CancellationTokenSource? _playbackSkipCts;
private string? _currentReplyPreview;
@@ -720,6 +723,8 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
{
_recognitionActive = true;
_recognitionSessionHadActivity = false;
+ _lastHypothesisText = null;
+ _lastHypothesisUtc = default;
if (updateListeningStatus && _status.Running && !_awaitingReply && !_isSpeaking)
{
_status = BuildRunningStatus(
@@ -809,6 +814,8 @@ private async Task StopRecognitionSessionAsync()
_recognitionActive = false;
_recognitionHealthCheckArmed = false;
+ _lastHypothesisText = null;
+ _lastHypothesisUtc = default;
}
try
@@ -829,6 +836,7 @@ private async void OnSpeechResultGenerated(
{
var result = args.Result;
var text = result.Text?.Trim();
+ var promotedHypothesis = false;
if (string.IsNullOrWhiteSpace(text))
{
return;
@@ -842,6 +850,21 @@ private async void OnSpeechResultGenerated(
return;
}
+ lock (_gate)
+ {
+ text = SelectRecognizedText(
+ text,
+ _lastHypothesisText,
+ _lastHypothesisUtc,
+ DateTime.UtcNow,
+ out promotedHypothesis);
+ }
+
+ if (promotedHypothesis)
+ {
+ _logger.Info($"Voice recognition promoted recent hypothesis to recover truncated final result: {text}");
+ }
+
_logger.Info($"Voice recognition result ({result.Confidence}): {text}");
await HandleRecognizedTextAsync(text);
}
@@ -905,6 +928,8 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
sessionKey = GetCurrentVoiceSessionKey();
_recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
+ _lastHypothesisText = text;
+ _lastHypothesisUtc = DateTime.UtcNow;
if (_status.State != VoiceRuntimeState.RecordingUtterance)
{
_status = BuildRunningStatus(
@@ -955,6 +980,8 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscriptUtc = DateTime.UtcNow;
_recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
+ _lastHypothesisText = null;
+ _lastHypothesisUtc = default;
cancellationToken = _runtimeCts.Token;
sessionKey = GetCurrentVoiceSessionKey();
}
@@ -1397,6 +1424,39 @@ internal static bool ShouldRebuildRecognitionAfterCompletion(
status == SpeechRecognitionResultStatus.TimeoutExceeded;
}
+ internal static string SelectRecognizedText(
+ string recognizedText,
+ string? latestHypothesisText,
+ DateTime latestHypothesisUtc,
+ DateTime utcNow,
+ out bool promotedHypothesis)
+ {
+ promotedHypothesis = false;
+
+ if (string.IsNullOrWhiteSpace(recognizedText) ||
+ string.IsNullOrWhiteSpace(latestHypothesisText) ||
+ utcNow - latestHypothesisUtc > HypothesisPromotionWindow)
+ {
+ return recognizedText;
+ }
+
+ var normalizedResult = recognizedText.Trim();
+ var normalizedHypothesis = latestHypothesisText.Trim();
+
+ if (normalizedHypothesis.Length <= normalizedResult.Length + 3)
+ {
+ return normalizedResult;
+ }
+
+ if (!normalizedHypothesis.EndsWith(normalizedResult, StringComparison.OrdinalIgnoreCase))
+ {
+ return normalizedResult;
+ }
+
+ promotedHypothesis = true;
+ return normalizedHypothesis;
+ }
+
private static string CreateReplyPreview(string text)
{
var trimmed = text.Trim();
@@ -1603,6 +1663,8 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_recognitionActive = false;
_recognitionSessionHadActivity = false;
_recognitionRestartInProgress = false;
+ _lastHypothesisText = null;
+ _lastHypothesisUtc = default;
synthesizer = _speechSynthesizer;
_speechSynthesizer = null;
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 4d8bf6d..7a3ad03 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -177,6 +177,30 @@ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledS
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData("Now again testing", "again testing", 1, true, "Now again testing")]
+ [InlineData("again testing", "again testing", 1, false, "again testing")]
+ [InlineData("Now again testing", "again testing", 3, false, "again testing")]
+ [InlineData("This is different", "again testing", 1, false, "again testing")]
+ public void SelectRecognizedText_PromotesRecentLongerHypothesisWhenFinalLooksTruncated(
+ string hypothesis,
+ string recognized,
+ int hypothesisAgeSeconds,
+ bool expectedPromoted,
+ string expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "SelectRecognizedText",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ var now = new DateTime(2026, 3, 25, 16, 45, 30, DateTimeKind.Utc);
+ var args = new object?[] { recognized, hypothesis, now.AddSeconds(-hypothesisAgeSeconds), now, null };
+
+ var result = (string)method.Invoke(null, args)!;
+
+ Assert.Equal(expected, result);
+ Assert.Equal(expectedPromoted, (bool)args[4]!);
+ }
+
private static MethodInfo GetMethod()
{
return typeof(VoiceService).GetMethod(
From cbb4fdcb5a536cae7305d3ccb2dc531c09c862ff Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 17:10:54 +0000
Subject: [PATCH 44/83] Refresh talk mode on default mic changes
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 78 +++++++++++++++++++
.../VoiceServiceTransportTests.cs | 24 ++++++
3 files changed, 103 insertions(+)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 98750cc..bc2c1c8 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -771,3 +771,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Fixed the tray-chat draft mirror so it clears immediately after direct send, and primed media playback before `Play()` so spoken replies stop clipping their opening syllables.
- `2026-03-25` Added a backlog story for true streaming TTS playback, including provider-catalog and latency-measurement notes.
- `2026-03-25` Corrected the MiniMax WebSocket request sequence by sending `task_finish` before reading audio, and added a guarded fallback that promotes a recent longer hypothesis when Windows only finalizes the tail of an utterance.
+- `2026-03-25` Added live default-microphone change handling for Talk Mode, so using the system default capture device now refreshes the recognizer when Windows switches to a new default mic such as AirPods.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 636c3e7..649aa05 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -80,6 +80,7 @@ public VoiceService(IOpenClawLogger logger, SettingsManager settings)
_cloudTextToSpeechClient = new VoiceCloudTextToSpeechClient();
_status = new VoiceStatusInfo();
_status = BuildStoppedStatus(null, null);
+ MediaDevice.DefaultAudioCaptureDeviceChanged += OnDefaultAudioCaptureDeviceChanged;
}
public VoiceStatusInfo CurrentStatus
@@ -392,6 +393,7 @@ public void Dispose()
}
_disposed = true;
+ MediaDevice.DefaultAudioCaptureDeviceChanged -= OnDefaultAudioCaptureDeviceChanged;
try
{
Task.Run(() => StopRuntimeResourcesAsync(updateStoppedStatus: true)).GetAwaiter().GetResult();
@@ -1750,6 +1752,18 @@ private static bool IsMainSessionKey(string sessionKey)
return sessionKey == DefaultSessionKey || sessionKey.Contains(":main:", StringComparison.Ordinal);
}
+ internal static bool ShouldRefreshRecognitionForDefaultCaptureDeviceChange(
+ bool running,
+ VoiceActivationMode mode,
+ string? configuredInputDeviceId,
+ AudioDeviceRole role)
+ {
+ return running &&
+ mode == VoiceActivationMode.TalkMode &&
+ string.IsNullOrWhiteSpace(configuredInputDeviceId) &&
+ role == AudioDeviceRole.Default;
+ }
+
private static string PrepareReplyForSpeech(string text)
{
var trimmed = text.Trim();
@@ -1816,6 +1830,70 @@ private VoiceStatusInfo BuildRunningStatus(
};
}
+ private async void OnDefaultAudioCaptureDeviceChanged(object sender, DefaultAudioCaptureDeviceChangedEventArgs args)
+ {
+ try
+ {
+ CancellationToken token;
+ bool shouldRefresh;
+ bool shouldRestartListening;
+ string? newDeviceId = args.Id;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _runtimeCts.IsCancellationRequested)
+ {
+ return;
+ }
+
+ shouldRefresh = ShouldRefreshRecognitionForDefaultCaptureDeviceChange(
+ _status.Running,
+ _status.Mode,
+ _settings.Voice.InputDeviceId,
+ args.Role);
+ shouldRestartListening = shouldRefresh && _recognitionActive && !_awaitingReply && !_isSpeaking;
+ token = _runtimeCts.Token;
+
+ if (shouldRefresh)
+ {
+ _recognitionRestartInProgress = shouldRestartListening;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ "Microphone device changed; refreshing speech recognition.");
+ }
+ }
+
+ if (!shouldRefresh)
+ {
+ return;
+ }
+
+ _logger.Info(
+ $"Default capture device changed to {newDeviceId ?? "(unknown)"}; refreshing TalkMode recognizer");
+
+ if (shouldRestartListening)
+ {
+ await StopRecognitionSessionAsync();
+ }
+
+ await RebuildSpeechRecognizerAsync("default capture device changed", token);
+
+ if (shouldRestartListening && !token.IsCancellationRequested)
+ {
+ await ResumeRecognitionSessionAsync(token, "default capture device changed");
+ }
+ }
+ catch (OperationCanceledException)
+ {
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Default capture device refresh failed: {ex.Message}");
+ }
+ }
+
private void ArmRecognitionHealthCheck()
{
lock (_gate)
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 7a3ad03..9052638 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -1,6 +1,7 @@
using System.Reflection;
using OpenClaw.Shared;
using OpenClawTray.Services.Voice;
+using Windows.Media.Devices;
using Windows.Media.SpeechRecognition;
namespace OpenClaw.Tray.Tests;
@@ -201,6 +202,29 @@ public void SelectRecognizedText_PromotesRecentLongerHypothesisWhenFinalLooksTru
Assert.Equal(expectedPromoted, (bool)args[4]!);
}
+ [Theory]
+ [InlineData(true, VoiceActivationMode.TalkMode, null, AudioDeviceRole.Default, true)]
+ [InlineData(true, VoiceActivationMode.TalkMode, "", AudioDeviceRole.Default, true)]
+ [InlineData(true, VoiceActivationMode.TalkMode, "device-1", AudioDeviceRole.Default, false)]
+ [InlineData(true, VoiceActivationMode.VoiceWake, null, AudioDeviceRole.Default, false)]
+ [InlineData(false, VoiceActivationMode.TalkMode, null, AudioDeviceRole.Default, false)]
+ [InlineData(true, VoiceActivationMode.TalkMode, null, AudioDeviceRole.Communications, false)]
+ public void ShouldRefreshRecognitionForDefaultCaptureDeviceChange_OnlyRefreshesTalkModeUsingSystemDefaultMic(
+ bool running,
+ VoiceActivationMode mode,
+ string? configuredInputDeviceId,
+ AudioDeviceRole role,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldRefreshRecognitionForDefaultCaptureDeviceChange",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(null, [running, mode, configuredInputDeviceId, role])!;
+
+ Assert.Equal(expected, result);
+ }
+
private static MethodInfo GetMethod()
{
return typeof(VoiceService).GetMethod(
From b7ea999edd4d066646d2332e1c7c8275f30b864b Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 18:01:24 +0000
Subject: [PATCH 45/83] Support selected playback devices
---
docs/VOICE-MODE.md | 8 ++++-
.../Services/Voice/VoiceService.cs | 36 ++++++++++++++++---
2 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index bc2c1c8..93ce8e6 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -490,7 +490,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Enabled` | Global feature kill-switch independent of mode |
| `SpeechToTextProviderId` | Selected STT provider id from the local provider catalog |
| `TextToSpeechProviderId` | Selected TTS provider id from the local provider catalog |
-| `InputDeviceId` / `OutputDeviceId` | Stable audio device binding |
+| `InputDeviceId` / `OutputDeviceId` | Preferred audio device binding, with selected-speaker support implemented first |
| `SampleRateHz` | Shared capture sample rate, fixed to a speech-friendly default |
| `CaptureChunkMs` | Frame size for capture, VAD, and wakeword processing |
| `BargeInEnabled` | Allows microphone capture while audio playback is active |
@@ -519,6 +519,11 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Voice.VoiceWake.EndSilenceMs` | int | `900` | voice wake | Silence timeout used to finalize the post-trigger utterance |
| `Voice.TalkMode.MinSpeechMs` | int | `250` | talk mode | Minimum detected speech duration before an utterance is treated as real input |
| `Voice.TalkMode.EndSilenceMs` | int | `900` | talk mode | Silence timeout used to finalize an utterance |
+
+Current status:
+
+- `Voice.OutputDeviceId` is now applied to Talk Mode playback through `MediaPlayer.AudioDevice`
+- `Voice.InputDeviceId` is still persisted and shown in settings, but explicit non-default microphone binding is not implemented yet
| `Voice.TalkMode.MaxUtteranceMs` | int | `15000` | talk mode | Hard cap on utterance length before forced submission/finalization |
| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching an `Assets\\voice-providers.json` entry |
| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
@@ -772,3 +777,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Added a backlog story for true streaming TTS playback, including provider-catalog and latency-measurement notes.
- `2026-03-25` Corrected the MiniMax WebSocket request sequence by sending `task_finish` before reading audio, and added a guarded fallback that promotes a recent longer hypothesis when Windows only finalizes the tail of an utterance.
- `2026-03-25` Added live default-microphone change handling for Talk Mode, so using the system default capture device now refreshes the recognizer when Windows switches to a new default mic such as AirPods.
+- `2026-03-25` Applied `Voice.OutputDeviceId` to Talk Mode playback via `MediaPlayer.AudioDevice`, so selected non-default speaker devices now work even though explicit non-default microphone capture is still pending.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 649aa05..5d107b2 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -517,17 +517,13 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
recognizer = await CreateSpeechRecognizerAsync(settings);
synthesizer = new SpeechSynthesizer();
player = new MediaPlayer();
+ await ConfigurePlaybackOutputDeviceAsync(player, settings);
if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
{
_logger.Warn("Selected input device is saved, but Talk Mode currently uses the system speech input device.");
}
- if (!string.IsNullOrWhiteSpace(settings.OutputDeviceId))
- {
- _logger.Warn("Selected output device is saved, but Talk Mode currently uses the default speech output device.");
- }
-
recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
@@ -616,6 +612,36 @@ private async Task<SpeechRecognizer> CreateSpeechRecognizerAsync(VoiceSettings s
return recognizer;
}
+ private async Task ConfigurePlaybackOutputDeviceAsync(MediaPlayer player, VoiceSettings settings)
+ {
+ if (string.IsNullOrWhiteSpace(settings.OutputDeviceId))
+ {
+ return;
+ }
+
+ try
+ {
+ var renderSelector = MediaDevice.GetAudioRenderSelector();
+ var renderDevices = await DeviceInformation.FindAllAsync(renderSelector);
+ var selectedRenderDevice = renderDevices.FirstOrDefault(device =>
+ string.Equals(device.Id, settings.OutputDeviceId, StringComparison.Ordinal));
+
+ if (selectedRenderDevice == null)
+ {
+ _logger.Warn(
+ $"Selected output device '{settings.OutputDeviceId}' was not found; falling back to the system default speaker.");
+ return;
+ }
+
+ player.AudioDevice = selectedRenderDevice;
+ _logger.Info($"Voice playback output device set to {selectedRenderDevice.Name}");
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Failed to configure selected output device: {ex.Message}");
+ }
+ }
+
private async Task EnsureMicrophoneConsentAsync()
{
if (!PackageHelper.IsPackaged)
From 0709911c25e9d3620af73b774b135e49a40579f5 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 18:21:38 +0000
Subject: [PATCH 46/83] Avoid stale voice preview replies
---
docs/VOICE-MODE.md | 1 +
src/OpenClaw.Shared/OpenClawGatewayClient.cs | 62 ++++++++++++++++---
.../OpenClawGatewayClientTests.cs | 50 +++++++++++++--
3 files changed, 101 insertions(+), 12 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 93ce8e6..5c541fe 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -778,3 +778,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Corrected the MiniMax WebSocket request sequence by sending `task_finish` before reading audio, and added a guarded fallback that promotes a recent longer hypothesis when Windows only finalizes the tail of an utterance.
- `2026-03-25` Added live default-microphone change handling for Talk Mode, so using the system default capture device now refreshes the recognizer when Windows switches to a new default mic such as AirPods.
- `2026-03-25` Applied `Voice.OutputDeviceId` to Talk Mode playback via `MediaPlayer.AudioDevice`, so selected non-default speaker devices now work even though explicit non-default microphone capture is still pending.
+- `2026-03-25` Hardened the gateway preview fallback so a bare final chat event does not replay the previous assistant reply when `sessions.preview` lags behind the real session update.
diff --git a/src/OpenClaw.Shared/OpenClawGatewayClient.cs b/src/OpenClaw.Shared/OpenClawGatewayClient.cs
index 45ce080..25710e7 100644
--- a/src/OpenClaw.Shared/OpenClawGatewayClient.cs
+++ b/src/OpenClaw.Shared/OpenClawGatewayClient.cs
@@ -15,7 +15,8 @@ public class OpenClawGatewayClient : WebSocketClientBase
private GatewayUsageStatusInfo? _usageStatus;
private GatewayCostUsageInfo? _usageCost;
private readonly Dictionary<string, string> _pendingRequestMethods = new();
- private readonly HashSet<string> _pendingChatPreviewSessionKeys = new();
+ private readonly Dictionary<string, PendingChatPreviewState> _pendingChatPreviewSessionKeys = new();
+ private readonly Dictionary<string, string> _lastAssistantMessagesBySession = new();
private readonly object _pendingRequestLock = new();
private readonly object _pendingChatPreviewLock = new();
private readonly object _sessionsLock = new();
@@ -27,6 +28,11 @@ public class OpenClawGatewayClient : WebSocketClientBase
private string _defaultChatSessionKey = DefaultChatSessionKey;
private const string DefaultChatSessionKey = "main";
+ private sealed class PendingChatPreviewState
+ {
+ public string? LastKnownAssistantText { get; init; }
+ public int AttemptCount { get; set; }
+ }
private void ResetUnsupportedMethodFlags()
{
@@ -880,6 +886,14 @@ private void HandleChatEvent(JsonElement root)
private void EmitChatMessage(string sessionKey, string role, string text, bool isFinal)
{
+ if (isFinal && string.Equals(role, "assistant", StringComparison.OrdinalIgnoreCase))
+ {
+ lock (_pendingChatPreviewLock)
+ {
+ _lastAssistantMessagesBySession[NormalizeChatSessionKey(sessionKey)] = text;
+ }
+ }
+
ChatMessageReceived?.Invoke(this, new ChatMessageEventArgs
{
SessionKey = sessionKey,
@@ -1199,18 +1213,36 @@ private void RequestChatPreviewForFinalState(string sessionKey)
}
var normalizedSessionKey = NormalizeChatSessionKey(sessionKey);
+ string? lastKnownAssistantText;
lock (_pendingChatPreviewLock)
{
- if (!_pendingChatPreviewSessionKeys.Add(normalizedSessionKey))
+ if (_pendingChatPreviewSessionKeys.ContainsKey(normalizedSessionKey))
{
return;
}
+
+ _lastAssistantMessagesBySession.TryGetValue(normalizedSessionKey, out lastKnownAssistantText);
+ _pendingChatPreviewSessionKeys[normalizedSessionKey] = new PendingChatPreviewState
+ {
+ LastKnownAssistantText = lastKnownAssistantText,
+ AttemptCount = 0
+ };
}
+ RequestChatPreviewForFinalStateAsync(normalizedSessionKey, delayMs: 0);
+ }
+
+ private void RequestChatPreviewForFinalStateAsync(string normalizedSessionKey, int delayMs)
+ {
_ = Task.Run(async () =>
{
try
{
+ if (delayMs > 0)
+ {
+ await Task.Delay(delayMs);
+ }
+
await RequestSessionPreviewAsync([normalizedSessionKey], limit: 2, maxChars: 4000);
}
catch (Exception ex)
@@ -1229,17 +1261,14 @@ private void EmitPendingChatPreviewMessages(SessionsPreviewPayloadInfo payload)
foreach (var preview in payload.Previews)
{
var normalizedSessionKey = NormalizeChatSessionKey(preview.Key);
- var shouldEmit = false;
+ PendingChatPreviewState? pendingState = null;
lock (_pendingChatPreviewLock)
{
- if (_pendingChatPreviewSessionKeys.Remove(normalizedSessionKey))
- {
- shouldEmit = true;
- }
+ _pendingChatPreviewSessionKeys.TryGetValue(normalizedSessionKey, out pendingState);
}
- if (!shouldEmit)
+ if (pendingState == null)
{
continue;
}
@@ -1254,6 +1283,23 @@ private void EmitPendingChatPreviewMessages(SessionsPreviewPayloadInfo payload)
continue;
}
+ if (string.Equals(assistantText, pendingState.LastKnownAssistantText, StringComparison.Ordinal))
+ {
+ if (pendingState.AttemptCount < 3)
+ {
+ pendingState.AttemptCount++;
+ _logger.Warn(
+ $"sessions.preview returned the previous assistant reply for {normalizedSessionKey}; retrying preview ({pendingState.AttemptCount}/3)");
+ RequestChatPreviewForFinalStateAsync(normalizedSessionKey, delayMs: 400 * pendingState.AttemptCount);
+ continue;
+ }
+ }
+
+ lock (_pendingChatPreviewLock)
+ {
+ _pendingChatPreviewSessionKeys.Remove(normalizedSessionKey);
+ }
+
_logger.Info($"Assistant response (preview): {assistantText.Substring(0, Math.Min(100, assistantText.Length))}...");
EmitChatMessage(normalizedSessionKey, "assistant", assistantText, isFinal: true);
EmitChatNotification(assistantText);
diff --git a/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs b/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
index d43f743..87f162b 100644
--- a/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
+++ b/tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs
@@ -1,6 +1,8 @@
using System;
+using System.Collections;
using System.Collections.Generic;
using System.Linq;
+using System.Reflection;
using System.Text.Json;
using Xunit;
using OpenClaw.Shared;
@@ -194,14 +196,26 @@ public SessionsPreviewPayloadInfo ParseSessionsPreviewPayload(string payloadJson
public int GetPendingChatPreviewSessionCount()
{
- var pending = GetPrivateField<HashSet<string>>("_pendingChatPreviewSessionKeys");
+ var pending = GetPrivateField<IDictionary>("_pendingChatPreviewSessionKeys");
return pending.Count;
}
- public void AddPendingChatPreviewSession(string sessionKey)
+ public void AddPendingChatPreviewSession(string sessionKey, string? lastKnownAssistantText = null, int attemptCount = 0)
{
- var pending = GetPrivateField<HashSet<string>>("_pendingChatPreviewSessionKeys");
- pending.Add(sessionKey);
+ var pending = GetPrivateField<IDictionary>("_pendingChatPreviewSessionKeys");
+ var stateType = typeof(OpenClawGatewayClient).GetNestedType(
+ "PendingChatPreviewState",
+ BindingFlags.NonPublic)!;
+ var state = Activator.CreateInstance(stateType)!;
+ stateType.GetProperty("LastKnownAssistantText")!.SetValue(state, lastKnownAssistantText);
+ stateType.GetProperty("AttemptCount")!.SetValue(state, attemptCount);
+ pending[sessionKey] = state;
+ }
+
+ public void SetLastAssistantMessage(string sessionKey, string text)
+ {
+ var lastMessages = GetPrivateField<IDictionary>("_lastAssistantMessagesBySession");
+ lastMessages[sessionKey] = text;
}
public ChatMessageEventArgs? ParseSessionsPreviewPayloadAndCaptureMessage(string payloadJson)
@@ -885,6 +899,34 @@ public void ParseSessionsPreview_EmitsAssistantMessage_ForQueuedFinalPreview()
Assert.Equal(0, helper.GetPendingChatPreviewSessionCount());
}
+ [Fact]
+ public void ParseSessionsPreview_DoesNotEmitStaleAssistantMessage_ForQueuedFinalPreview()
+ {
+ var helper = new GatewayClientTestHelper();
+ helper.SetUnsupportedMethodFlags(usageStatus: false, usageCost: false, sessionPreview: true, nodeList: false);
+ helper.SetLastAssistantMessage("main", "world");
+ helper.AddPendingChatPreviewSession("main", lastKnownAssistantText: "world");
+
+ var captured = helper.ParseSessionsPreviewPayloadAndCaptureMessage("""
+ {
+ "ts": 1739760000000,
+ "previews": [
+ {
+ "key": "agent:main:main",
+ "status": "ok",
+ "items": [
+ { "role": "user", "text": "hello again" },
+ { "role": "assistant", "text": "world" }
+ ]
+ }
+ ]
+ }
+ """);
+
+ Assert.Null(captured);
+ Assert.Equal(1, helper.GetPendingChatPreviewSessionCount());
+ }
+
[Fact]
public void SerializeConnectRequest_UsesCliClientModeAndOperatorScopes()
{
From 13174bab119c005048bd08b508ba373c80bfec15 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 18:26:56 +0000
Subject: [PATCH 47/83] Document AudioGraph voice input architecture
---
docs/VOICE-MODE.md | 241 +++++++++++++++++++++++++++++++--------------
1 file changed, 168 insertions(+), 73 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 5c541fe..42e4834 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -407,6 +407,9 @@ The voice subsystem is introduced as a new node capability category: `voice`.
| `voice.status.get` | Return runtime voice status | none | `VoiceStatusInfo` |
| `voice.start` | Start the voice runtime with the supplied or persisted mode | `VoiceStartArgs` | `VoiceStatusInfo` |
| `voice.stop` | Stop the voice runtime | `VoiceStopArgs` | `VoiceStatusInfo` |
+| `voice.pause` | Pause the active voice runtime | `VoicePauseArgs` | `VoiceStatusInfo` |
+| `voice.resume` | Resume a paused voice runtime | `VoiceResumeArgs` | `VoiceStatusInfo` |
+| `voice.skip` | Skip the currently spoken reply and advance the queue if another reply is pending | `VoiceSkipArgs` | `VoiceStatusInfo` |
### Payload Types
@@ -417,6 +420,9 @@ The voice subsystem is introduced as a new node capability category: `voice`.
- `VoiceStatusInfo`
- `VoiceStartArgs`
- `VoiceStopArgs`
+- `VoicePauseArgs`
+- `VoiceResumeArgs`
+- `VoiceSkipArgs`
- `VoiceSettingsUpdateArgs`
These contracts are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/VoiceModeSchema.cs).
@@ -519,11 +525,6 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
| `Voice.VoiceWake.EndSilenceMs` | int | `900` | voice wake | Silence timeout used to finalize the post-trigger utterance |
| `Voice.TalkMode.MinSpeechMs` | int | `250` | talk mode | Minimum detected speech duration before an utterance is treated as real input |
| `Voice.TalkMode.EndSilenceMs` | int | `900` | talk mode | Silence timeout used to finalize an utterance |
-
-Current status:
-
-- `Voice.OutputDeviceId` is now applied to Talk Mode playback through `MediaPlayer.AudioDevice`
-- `Voice.InputDeviceId` is still persisted and shown in settings, but explicit non-default microphone binding is not implemented yet
| `Voice.TalkMode.MaxUtteranceMs` | int | `15000` | talk mode | Hard cap on utterance length before forced submission/finalization |
| `VoiceProviderConfiguration.Providers[].ProviderId` | string | none | cloud providers | Provider id matching an `Assets\\voice-providers.json` entry |
| `VoiceProviderConfiguration.Providers[].Values["apiKey"]` | string? | `null` | cloud providers | API key sent using the provider contract's configured auth header |
@@ -531,80 +532,160 @@ Current status:
| `VoiceProviderConfiguration.Providers[].Values["voiceId"]` | string? | provider default | cloud providers | Voice id inserted into the configured request template or URL |
| `VoiceProviderConfiguration.Providers[].Values["voiceSettingsJson"]` | string? | provider default | cloud providers | Raw JSON fragment inserted into the configured request template; may be a keyed fragment like `"voice_setting": { ... }` |
-At runtime today, those device ids are persisted and surfaced in the UI, but the v1 `TalkMode` path still uses the Windows system speech stack defaults for capture and playback.
+At runtime today:
+
+- `Voice.OutputDeviceId` is applied to Talk Mode playback through `MediaPlayer.AudioDevice`
+- `Voice.InputDeviceId` is still persisted and shown in the UI, but Talk Mode capture still uses the Windows default speech input path
+- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
+- explicit non-default microphone binding is still pending the planned `AudioGraph` capture refactor
+
+## Current Runtime Architecture
+
+The current Windows implementation is still centred on `VoiceService`, with a few supporting seams around it:
-## Component Architecture
+- `VoiceCapability`
+ exposes shared `voice.*` commands to the node/gateway surface
+- `VoiceService`
+ owns Talk Mode runtime state, Windows STT/TTS integration, reply queuing, timeouts, and gateway reply handling
+- `VoiceChatCoordinator`
+ mirrors interim transcript drafts and conversation turns into the tray UI without making the chat window part of the transport path
+- `OpenClawGatewayClient`
+ carries direct `chat.send`, final chat events, and the `sessions.preview` fallback path for bare final markers
+- `WebChatWindow`
+ mirrors live transcript drafts locally and optionally strips injected `<relevant-memories>` blocks from rendered chat text
+
+### Current End-to-End Talk Mode
```mermaid
flowchart LR
- A["NodeService<br/>control + lifecycle"] --> B["VoiceCapability<br/>command surface"]
- B --> C["VoiceCoordinator<br/>runtime state machine"]
- C --> D["SpeechRecognizer<br/>Windows continuous dictation"]
- C --> E["VoiceWakeService<br/>NanoWakeWord scores"]
- C --> F["VoiceActivityDetector<br/>speech/silence segments"]
- C --> G["VoiceTransport<br/>operator sidecar + chat.send exchange"]
- C --> H["SpeechSynthesizer + MediaPlayer<br/>reply playback"]
- B --> I["SettingsManager / SettingsData.Voice<br/>persisted config JSON"]
+ A["User speech"] --> B["Windows SpeechRecognizer<br/>continuous dictation on current default mic"]
+ B --> C["HypothesisGenerated<br/>interim text"]
+ C --> D["VoiceService<br/>draft event"]
+ D --> E["VoiceChatCoordinator"]
+ E --> F["WebChatWindow<br/>local compose-box mirror only"]
+
+ B --> G["ResultGenerated<br/>final Medium/High text"]
+ G --> H["VoiceService<br/>duplicate guard + late hypothesis promotion"]
+ H --> I["Stop recognition session"]
+ I --> J["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
+ J --> K["OpenClaw / session pipeline"]
+ K --> L["Chat final event"]
+ L --> M{"assistant text present?"}
+ M -- "yes" --> N["assistant text"]
+ M -- "no" --> O["sessions.preview fallback<br/>with stale-preview retry guard"]
+ O --> N
+
+ N --> P["VoiceService reply queue"]
+ P --> Q{"TTS provider"}
+ Q -- "windows" --> R["SpeechSynthesizer"]
+ Q -- "cloud" --> S["VoiceCloudTextToSpeechClient<br/>MiniMax websocket or other provider"]
+ R --> T["Complete playable stream"]
+ S --> T
+ T --> U["MediaPlayer<br/>selected OutputDeviceId if set"]
+ U --> V["Speaker / headset output"]
+ V --> W["Resume recognition when queue drains"]
```
-## Runtime Data Flow
+### Current Processing Stages
-### Voice Wake Mode
+| Stage | Component | Input | Output |
+|---|---|---|---|
+| 1 | `SpeechRecognizer` | Windows default speech-input path | interim/final transcript text |
+| 2 | `VoiceService` | final transcript text | de-duplicated transcript + runtime state changes |
+| 3 | `VoiceChatCoordinator` | interim/final draft events | mirrored tray chat compose text |
+| 4 | `OpenClawGatewayClient` | transcript text + session key | `chat.send` request + assistant reply events |
+| 5 | `OpenClawGatewayClient` preview fallback | bare final chat marker | assistant preview text, guarded against stale replay |
+| 6 | `VoiceService` reply queue | assistant reply text | ordered reply playback work |
+| 7 | `VoiceCloudTextToSpeechClient` / `SpeechSynthesizer` | assistant reply text | complete playable audio stream |
+| 8 | `MediaPlayer` | complete playable audio stream | rendered audio on default or selected speaker |
-```mermaid
-flowchart TD
- A["Microphone device<br/>float/PCM hardware frames"] --> B["AudioCaptureService<br/>PCM16 mono 16kHz chunks"]
- B --> C["Ring Buffer<br/>bounded pre-roll PCM16 frames"]
-B --> D["VoiceWakeService (NanoWakeWord)<br/>wake score per chunk"]
- D --> E{"score >= trigger threshold?"}
- E -- "no" --> B
-E -- "yes" --> F["VoiceCoordinator<br/>VoiceWakeDetected(session state change)"]
- F --> G["UtteranceAssembler<br/>seed with pre-roll PCM16 from Ring Buffer"]
- C --> G
- B --> G
- G --> H["VoiceActivityDetector<br/>speech/silence state from PCM16 chunks"]
- H --> I{"speech still active?"}
- I -- "yes" --> B
- I -- "no, end silence reached" --> J["Finalize utterance<br/>PCM16 buffer + timing metadata"]
- J --> K["SpeechRecognizer<br/>utterance PCM16 -> transcript text"]
- K --> L["VoiceTransport<br/>chat.send(main, transcript)"]
- L --> M["OpenClaw conversation pipeline<br/>assistant reply text"]
- M --> N["AudioPlaybackService<br/>TTS output bytes / decoded PCM"]
- N --> O["Speaker device<br/>rendered audio"]
- O --> P{"barge-in enabled?"}
- P -- "yes" --> B
- P -- "no, playback complete" --> B
-```
+## Planned AudioGraph Input Architecture
+
+The next input-phase refactor should move microphone ownership away from `SpeechRecognizer` and into an explicit capture pipeline built around `AudioGraph`.
+
+The purpose of that change is to unlock:
+
+- true selected non-default microphone support
+- streaming rather than utterance-owned capture
+- a proper ring buffer and VAD pipeline
+- future non-Windows and streaming STT providers
+- future barge-in / duplex work
-### Always-On Mode
+### Target Input Stack
```mermaid
flowchart TD
- A["Windows speech input<br/>default microphone path"] --> B["SpeechRecognizer<br/>continuous dictation result text"]
- B --> C{"final recognized text?"}
- C -- "no" --> A
- C -- "yes" --> D["VoiceCoordinator<br/>pause listening and mark AwaitingResponse"]
- D --> E["VoiceTransport<br/>chat.send(main, transcript)"]
- E --> F["OpenClaw conversation pipeline<br/>assistant reply text"]
- F --> G["SpeechSynthesizer<br/>assistant text -> audio stream"]
- G --> H["MediaPlayer<br/>reply playback"]
- H --> I["Windows audio output<br/>default speaker path"]
- I --> J["VoiceCoordinator<br/>resume continuous listening"]
- J --> A
+ A["Selected microphone device id<br/>or system default mic"] --> B["VoiceCaptureService<br/>AudioGraph input node"]
+ B --> C["PCM frame stream<br/>fixed chunk duration"]
+ C --> D["Ring buffer<br/>bounded pre-roll"]
+ C --> E["VoiceActivityDetector"]
+ C --> F["VoiceWake engine<br/>later"]
+ C --> G["SpeechToText adapter"]
+ E --> H["UtteranceAssembler<br/>for non-streaming STT adapters"]
+ D --> H
+ H --> G
+ G --> I["Transcript events<br/>interim + final"]
+ I --> J["VoiceService / runtime controller"]
+ J --> K["OpenClawGatewayClient<br/>chat.send + reply events"]
```
-## Processing Stages and Data Types
+### Proposed Seams
-| Stage | Component | Input | Output |
-|---|---|---|---|
-| 1 | `SpeechRecognizer` | Windows microphone capture | recognized transcript text |
-| 2a | `VoiceWakeService` | PCM16 chunk | wake score / trigger decision |
-| 2b | `VoiceActivityDetector` | PCM16 chunk | speech/silence state |
-| 3 | `Ring Buffer` | PCM16 chunk stream | bounded pre-roll PCM16 window |
-| 4 | `UtteranceAssembler` | pre-roll + live PCM16 chunks | utterance PCM16 buffer |
-| 5 | `SpeechRecognizer` | utterance PCM16 + timing metadata | transcript text |
-| 6 | `VoiceTransport` | transcript text + session key | `chat.send` request / assistant reply text |
-| 7 | `SpeechSynthesizer + MediaPlayer` | assistant reply text | speaker render stream |
+The target split should look like this:
+
+- `VoiceCaptureService`
+ - owns `AudioGraph`
+ - binds to an explicit input device id when one is selected
+ - emits continuous PCM frames
+- `IVoiceActivityDetector`
+ - emits speech / silence transitions from frame data
+- `IUtteranceAssembler`
+ - builds bounded utterances from frames for non-streaming STT backends
+- `ISpeechToTextAdapter`
+ - consumes either live frames or completed utterances
+ - emits interim and final transcript events
+- `VoiceService`
+ - remains the runtime orchestrator rather than the owner of low-level capture
+
+### Proposed STT Adapter Contract
+
+The STT conversion layer should no longer be "whatever `SpeechRecognizer` does internally". It should become an adapter boundary.
+
+Suggested shape:
+
+- `StartAsync(SpeechToTextStartArgs)`
+- `StopAsync()`
+- `PushFramesAsync(ReadOnlyMemory<byte> pcm16, FrameMetadata metadata)` for streaming-capable adapters
+- `SubmitUtteranceAsync(ReadOnlyMemory<byte> pcm16, UtteranceMetadata metadata)` for utterance-based adapters
+- events:
+ - `InterimTranscriptReceived`
+ - `FinalTranscriptReceived`
+ - `RecognitionFaulted`
+
+Likely first adapters:
+
+- `WindowsSpeechToTextAdapter`
+ only if Windows gives us a clean explicit-audio-input path
+- `StreamingCloudSpeechToTextAdapter`
+ for providers that accept pushed PCM/audio streams
+- `UtteranceCloudSpeechToTextAdapter`
+ for providers that still expect bounded utterance uploads
+
+## Selected-Device Roadmap
+
+The current selected-device position is now:
+
+- selected non-default speaker: implemented
+- selected non-default microphone: not implemented yet
+
+Recommended engineering order:
+
+1. keep the current selected-speaker playback support
+2. introduce `VoiceCaptureService` on `AudioGraph`
+3. move Talk Mode input from `SpeechRecognizer` ownership to captured PCM frames
+4. introduce `ISpeechToTextAdapter`
+5. add explicit selected-microphone binding
+6. then revisit duplex/barge-in and streaming STT
## Control Flow
@@ -612,7 +693,7 @@ flowchart TD
sequenceDiagram
participant Gateway as Gateway / Operator
participant VoiceCap as VoiceCapability
- participant Coord as VoiceCoordinator
+ participant Coord as VoiceService
participant Store as SettingsData.Voice
Gateway->>VoiceCap: voice.settings.get
@@ -622,14 +703,29 @@ sequenceDiagram
VoiceCap->>Store: save VoiceSettings
VoiceCap-->>Gateway: VoiceSettings
-Gateway->>VoiceCap: voice.start(mode=VoiceWake, sessionKey=...)
+Gateway->>VoiceCap: voice.start(mode=TalkMode, sessionKey=...)
VoiceCap->>Coord: Start(VoiceStartArgs)
-Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForVoiceWake)
+Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
VoiceCap-->>Gateway: VoiceStatusInfo
Gateway->>VoiceCap: voice.status.get
VoiceCap-->>Gateway: VoiceStatusInfo
+ Gateway->>VoiceCap: voice.pause(reason=...)
+ VoiceCap->>Coord: Pause()
+ Coord-->>VoiceCap: VoiceStatusInfo(state=Paused)
+ VoiceCap-->>Gateway: VoiceStatusInfo
+
+ Gateway->>VoiceCap: voice.resume(reason=...)
+ VoiceCap->>Coord: Resume()
+ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
+ VoiceCap-->>Gateway: VoiceStatusInfo
+
+ Gateway->>VoiceCap: voice.skip(reason=...)
+ VoiceCap->>Coord: SkipCurrentReply()
+ Coord-->>VoiceCap: VoiceStatusInfo
+ VoiceCap-->>Gateway: VoiceStatusInfo
+
Gateway->>VoiceCap: voice.stop(reason=...)
VoiceCap->>Coord: Stop()
Coord-->>VoiceCap: VoiceStatusInfo(state=Stopped)
@@ -650,10 +746,10 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningForVoiceWake)
### New Components Expected Later
- `VoiceCapability` in `OpenClaw.Shared.Capabilities`
-- `AudioCaptureService` in `OpenClaw.Tray.WinUI.Services`
+- `VoiceCaptureService` in `OpenClaw.Tray.WinUI.Services`
- `VoiceWakeService` in `OpenClaw.Tray.WinUI.Services`
-- `VoiceCoordinator` in `OpenClaw.Tray.WinUI.Services`
-- `AudioPlaybackService` in `OpenClaw.Tray.WinUI.Services`
+- `VoiceChatCoordinator` in `OpenClaw.Tray.WinUI.Services`
+- `VoicePlaybackService` in `OpenClaw.Tray.WinUI.Services`
## Provider Direction
@@ -675,14 +771,12 @@ This keeps the provider surface narrow while still meeting the required MiniMax/
### Story: Support non-local (or non-Windows, local) STT providers
-Allow the user to select a non-local STT provider like OpenAI Whisper, or a local non-Windows
+Allow the user to select a non-local STT provider like OpenAI Whisper, or a local non-Windows recognizer, instead of being locked to the Windows built-in path.
- Windows built-in local STT is working pretty well, however users should have the choice to utilise:
- a non-local STT provider
- a local non-Windows STT provider
-We're all about the choices, Baby!
-
### Story: Full-duplex / barge-in Talk Mode
@@ -779,3 +873,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Added live default-microphone change handling for Talk Mode, so using the system default capture device now refreshes the recognizer when Windows switches to a new default mic such as AirPods.
- `2026-03-25` Applied `Voice.OutputDeviceId` to Talk Mode playback via `MediaPlayer.AudioDevice`, so selected non-default speaker devices now work even though explicit non-default microphone capture is still pending.
- `2026-03-25` Hardened the gateway preview fallback so a bare final chat event does not replay the previous assistant reply when `sessions.preview` lags behind the real session update.
+- `2026-03-25` Updated the voice-mode architecture document with an accurate current Talk Mode flow, the planned `AudioGraph` input design, the STT adapter seam, and the selected-device roadmap.
From 7046346e4ebc200cb45c608edd1829f42a271fd4 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 20:39:58 +0000
Subject: [PATCH 48/83] Add AudioGraph capture backbone for talk mode
---
docs/VOICE-MODE.md | 91 ++--
.../Services/Voice/VoiceCaptureService.cs | 419 ++++++++++++++++++
.../Services/Voice/VoiceService.cs | 95 +++-
.../VoiceServiceTransportTests.cs | 57 ++-
4 files changed, 609 insertions(+), 53 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 42e4834..eddbfe5 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -535,9 +535,10 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
At runtime today:
- `Voice.OutputDeviceId` is applied to Talk Mode playback through `MediaPlayer.AudioDevice`
-- `Voice.InputDeviceId` is still persisted and shown in the UI, but Talk Mode capture still uses the Windows default speech input path
+- `VoiceCaptureService` now runs an `AudioGraph` capture pipeline in parallel with Talk Mode and binds it to the selected or default microphone device
+- `Voice.InputDeviceId` is now used by that `AudioGraph` capture path, but transcript generation still uses the Windows default speech input path until the STT adapter migration is complete
- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
-- explicit non-default microphone binding is still pending the planned `AudioGraph` capture refactor
+- explicit non-default microphone transcript generation is still pending the planned STT adapter migration
## Current Runtime Architecture
@@ -545,8 +546,10 @@ The current Windows implementation is still centred on `VoiceService`, with a fe
- `VoiceCapability`
exposes shared `voice.*` commands to the node/gateway surface
+- `VoiceCaptureService`
+ owns the new `AudioGraph` capture backbone, selected/default microphone binding, and live signal detection
- `VoiceService`
- owns Talk Mode runtime state, Windows STT/TTS integration, reply queuing, timeouts, and gateway reply handling
+ owns Talk Mode runtime state, recognizer/TTS integration, reply queuing, timeouts, gateway reply handling, and the transition layer between `AudioGraph` capture and the current recognizer-owned STT path
- `VoiceChatCoordinator`
mirrors interim transcript drafts and conversation turns into the tray UI without making the chat window part of the transport path
- `OpenClawGatewayClient`
@@ -558,46 +561,52 @@ The current Windows implementation is still centred on `VoiceService`, with a fe
```mermaid
flowchart LR
- A["User speech"] --> B["Windows SpeechRecognizer<br/>continuous dictation on current default mic"]
- B --> C["HypothesisGenerated<br/>interim text"]
- C --> D["VoiceService<br/>draft event"]
- D --> E["VoiceChatCoordinator"]
- E --> F["WebChatWindow<br/>local compose-box mirror only"]
-
- B --> G["ResultGenerated<br/>final Medium/High text"]
- G --> H["VoiceService<br/>duplicate guard + late hypothesis promotion"]
- H --> I["Stop recognition session"]
- I --> J["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
- J --> K["OpenClaw / session pipeline"]
- K --> L["Chat final event"]
- L --> M{"assistant text present?"}
- M -- "yes" --> N["assistant text"]
- M -- "no" --> O["sessions.preview fallback<br/>with stale-preview retry guard"]
- O --> N
-
- N --> P["VoiceService reply queue"]
- P --> Q{"TTS provider"}
- Q -- "windows" --> R["SpeechSynthesizer"]
- Q -- "cloud" --> S["VoiceCloudTextToSpeechClient<br/>MiniMax websocket or other provider"]
- R --> T["Complete playable stream"]
- S --> T
- T --> U["MediaPlayer<br/>selected OutputDeviceId if set"]
- U --> V["Speaker / headset output"]
- V --> W["Resume recognition when queue drains"]
+ A["User speech"] --> B["VoiceCaptureService<br/>AudioGraph on selected/default mic"]
+ A --> C["Windows SpeechRecognizer<br/>continuous dictation on current default mic"]
+
+ B --> D["FrameCaptured / SignalDetected"]
+ D --> E["VoiceService<br/>capture-backed health + device state"]
+
+ C --> F["HypothesisGenerated<br/>interim text"]
+ F --> G["VoiceService<br/>draft event"]
+ G --> H["VoiceChatCoordinator"]
+ H --> I["WebChatWindow<br/>local compose-box mirror only"]
+
+ C --> J["ResultGenerated<br/>final Medium/High text"]
+ J --> K["VoiceService<br/>duplicate guard + late hypothesis promotion"]
+ K --> L["Stop recognition session"]
+ L --> M["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
+ M --> N["OpenClaw / session pipeline"]
+ N --> O["Chat final event"]
+ O --> P{"assistant text present?"}
+ P -- "yes" --> Q["assistant text"]
+ P -- "no" --> R["sessions.preview fallback<br/>with stale-preview retry guard"]
+ R --> Q
+
+ Q --> S["VoiceService reply queue"]
+ S --> T{"TTS provider"}
+ T -- "windows" --> U["SpeechSynthesizer"]
+ T -- "cloud" --> V["VoiceCloudTextToSpeechClient<br/>MiniMax websocket or other provider"]
+ U --> W["Complete playable stream"]
+ V --> W
+ W --> X["MediaPlayer<br/>selected OutputDeviceId if set"]
+ X --> Y["Speaker / headset output"]
+ Y --> Z["Resume recognition when queue drains"]
```
### Current Processing Stages
| Stage | Component | Input | Output |
|---|---|---|---|
-| 1 | `SpeechRecognizer` | Windows default speech-input path | interim/final transcript text |
-| 2 | `VoiceService` | final transcript text | de-duplicated transcript + runtime state changes |
-| 3 | `VoiceChatCoordinator` | interim/final draft events | mirrored tray chat compose text |
-| 4 | `OpenClawGatewayClient` | transcript text + session key | `chat.send` request + assistant reply events |
-| 5 | `OpenClawGatewayClient` preview fallback | bare final chat marker | assistant preview text, guarded against stale replay |
-| 6 | `VoiceService` reply queue | assistant reply text | ordered reply playback work |
-| 7 | `VoiceCloudTextToSpeechClient` / `SpeechSynthesizer` | assistant reply text | complete playable audio stream |
-| 8 | `MediaPlayer` | complete playable audio stream | rendered audio on default or selected speaker |
+| 1 | `VoiceCaptureService` | selected/default microphone device | continuous frame and signal events from `AudioGraph` |
+| 2 | `SpeechRecognizer` | Windows default speech-input path | interim/final transcript text |
+| 3 | `VoiceService` | capture signal + final transcript text | health/restart decisions, de-duplicated transcript, runtime state changes |
+| 4 | `VoiceChatCoordinator` | interim/final draft events | mirrored tray chat compose text |
+| 5 | `OpenClawGatewayClient` | transcript text + session key | `chat.send` request + assistant reply events |
+| 6 | `OpenClawGatewayClient` preview fallback | bare final chat marker | assistant preview text, guarded against stale replay |
+| 7 | `VoiceService` reply queue | assistant reply text | ordered reply playback work |
+| 8 | `VoiceCloudTextToSpeechClient` / `SpeechSynthesizer` | assistant reply text | complete playable audio stream |
+| 9 | `MediaPlayer` | complete playable audio stream | rendered audio on default or selected speaker |
## Planned AudioGraph Input Architecture
@@ -676,15 +685,16 @@ Likely first adapters:
The current selected-device position is now:
- selected non-default speaker: implemented
-- selected non-default microphone: not implemented yet
+- selected/default microphone binding for `AudioGraph` capture: implemented
+- selected non-default microphone for actual transcript generation: not implemented yet
Recommended engineering order:
1. keep the current selected-speaker playback support
-2. introduce `VoiceCaptureService` on `AudioGraph`
+2. extend the live `VoiceCaptureService` path into the STT side
3. move Talk Mode input from `SpeechRecognizer` ownership to captured PCM frames
4. introduce `ISpeechToTextAdapter`
-5. add explicit selected-microphone binding
+5. complete explicit selected-microphone transcript generation
6. then revisit duplex/barge-in and streaming STT
## Control Flow
@@ -874,3 +884,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Applied `Voice.OutputDeviceId` to Talk Mode playback via `MediaPlayer.AudioDevice`, so selected non-default speaker devices now work even though explicit non-default microphone capture is still pending.
- `2026-03-25` Hardened the gateway preview fallback so a bare final chat event does not replay the previous assistant reply when `sessions.preview` lags behind the real session update.
- `2026-03-25` Updated the voice-mode architecture document with an accurate current Talk Mode flow, the planned `AudioGraph` input design, the STT adapter seam, and the selected-device roadmap.
+- `2026-03-25` Added `VoiceCaptureService` on `AudioGraph`, wired it into Talk Mode lifecycle, and started using live capture signal as part of recognizer health and device-refresh handling while transcript generation still remains on the Windows speech recognizer.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
new file mode 100644
index 0000000..b0ebc57
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
@@ -0,0 +1,419 @@
+using System;
+using System.Linq;
+using System.Runtime.InteropServices;
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+using Windows.Devices.Enumeration;
+using Windows.Media;
+using Windows.Media.Audio;
+using Windows.Media.Capture;
+using Windows.Media.Devices;
+using Windows.Media.Render;
+
+namespace OpenClawTray.Services.Voice;
+
+public sealed class VoiceAudioFrameEventArgs : EventArgs
+{
+ public VoiceAudioFrameEventArgs(
+ string? deviceId,
+ string? deviceName,
+ DateTime utcTimestamp,
+ int sampleRateHz,
+ int channelCount,
+ byte[] data,
+ float peakLevel)
+ {
+ DeviceId = deviceId;
+ DeviceName = deviceName;
+ UtcTimestamp = utcTimestamp;
+ SampleRateHz = sampleRateHz;
+ ChannelCount = channelCount;
+ Data = data;
+ PeakLevel = peakLevel;
+ }
+
+ public string? DeviceId { get; }
+ public string? DeviceName { get; }
+ public DateTime UtcTimestamp { get; }
+ public int SampleRateHz { get; }
+ public int ChannelCount { get; }
+ public byte[] Data { get; }
+ public float PeakLevel { get; }
+}
+
+public sealed class VoiceCaptureSignalEventArgs : EventArgs
+{
+ public VoiceCaptureSignalEventArgs(
+ string? deviceId,
+ string? deviceName,
+ DateTime utcTimestamp,
+ float peakLevel)
+ {
+ DeviceId = deviceId;
+ DeviceName = deviceName;
+ UtcTimestamp = utcTimestamp;
+ PeakLevel = peakLevel;
+ }
+
+ public string? DeviceId { get; }
+ public string? DeviceName { get; }
+ public DateTime UtcTimestamp { get; }
+ public float PeakLevel { get; }
+}
+
+public sealed class VoiceCaptureService : IAsyncDisposable
+{
+ private const float DefaultSignalThreshold = 0.015f;
+
+ private readonly IOpenClawLogger _logger;
+ private readonly object _gate = new();
+
+ private AudioGraph? _audioGraph;
+ private AudioDeviceInputNode? _deviceInputNode;
+ private AudioFrameOutputNode? _frameOutputNode;
+ private DeviceInformation? _activeCaptureDevice;
+ private int _sampleRateHz;
+ private int _channelCount;
+
+ public VoiceCaptureService(IOpenClawLogger logger)
+ {
+ _logger = logger;
+ }
+
+ public event EventHandler<VoiceAudioFrameEventArgs>? FrameCaptured;
+ public event EventHandler<VoiceCaptureSignalEventArgs>? SignalDetected;
+
+ public bool IsRunning
+ {
+ get
+ {
+ lock (_gate)
+ {
+ return _audioGraph != null;
+ }
+ }
+ }
+
+ public string? ActiveDeviceId
+ {
+ get
+ {
+ lock (_gate)
+ {
+ return _activeCaptureDevice?.Id;
+ }
+ }
+ }
+
+ public string? ActiveDeviceName
+ {
+ get
+ {
+ lock (_gate)
+ {
+ return _activeCaptureDevice?.Name;
+ }
+ }
+ }
+
+ public async Task StartAsync(VoiceSettings settings, CancellationToken cancellationToken)
+ {
+ ArgumentNullException.ThrowIfNull(settings);
+
+ await StopAsync();
+ cancellationToken.ThrowIfCancellationRequested();
+
+ AudioGraph? audioGraph = null;
+ AudioDeviceInputNode? deviceInputNode = null;
+ AudioFrameOutputNode? frameOutputNode = null;
+
+ try
+ {
+ var graphSettings = new AudioGraphSettings(AudioRenderCategory.Speech)
+ {
+ QuantumSizeSelectionMode = QuantumSizeSelectionMode.ClosestToDesired,
+ DesiredSamplesPerQuantum = (int)ResolveDesiredSamplesPerQuantum(settings.SampleRateHz, settings.CaptureChunkMs)
+ };
+
+ var graphCreation = await AudioGraph.CreateAsync(graphSettings);
+ if (graphCreation.Status != AudioGraphCreationStatus.Success || graphCreation.Graph == null)
+ {
+ throw new InvalidOperationException($"AudioGraph unavailable: {graphCreation.Status}");
+ }
+
+ audioGraph = graphCreation.Graph;
+ var captureDevice = await ResolveCaptureDeviceAsync(settings.InputDeviceId);
+ var inputCreation = await audioGraph.CreateDeviceInputNodeAsync(
+ MediaCategory.Speech,
+ audioGraph.EncodingProperties,
+ captureDevice);
+
+ if (inputCreation.Status != AudioDeviceNodeCreationStatus.Success || inputCreation.DeviceInputNode == null)
+ {
+ throw new InvalidOperationException($"Audio input node unavailable: {inputCreation.Status}");
+ }
+
+ deviceInputNode = inputCreation.DeviceInputNode;
+ frameOutputNode = audioGraph.CreateFrameOutputNode(audioGraph.EncodingProperties);
+ deviceInputNode.AddOutgoingConnection(frameOutputNode);
+
+ audioGraph.QuantumStarted += OnAudioGraphQuantumStarted;
+ audioGraph.UnrecoverableErrorOccurred += OnAudioGraphUnrecoverableErrorOccurred;
+
+ lock (_gate)
+ {
+ _audioGraph = audioGraph;
+ _deviceInputNode = deviceInputNode;
+ _frameOutputNode = frameOutputNode;
+ _activeCaptureDevice = captureDevice;
+ _sampleRateHz = (int)audioGraph.EncodingProperties.SampleRate;
+ _channelCount = (int)audioGraph.EncodingProperties.ChannelCount;
+ }
+
+ frameOutputNode.Start();
+ deviceInputNode.Start();
+ audioGraph.Start();
+
+ audioGraph = null;
+ deviceInputNode = null;
+ frameOutputNode = null;
+
+ _logger.Info(
+ $"Voice capture graph started on {(captureDevice?.Name ?? "system default microphone")} ({captureDevice?.Id ?? "default"})");
+ }
+ finally
+ {
+ if (frameOutputNode != null)
+ {
+ try { frameOutputNode.Stop(); } catch { }
+ try { frameOutputNode.Dispose(); } catch { }
+ }
+
+ if (deviceInputNode != null)
+ {
+ try { deviceInputNode.Stop(); } catch { }
+ try { deviceInputNode.Dispose(); } catch { }
+ }
+
+ if (audioGraph != null)
+ {
+ audioGraph.QuantumStarted -= OnAudioGraphQuantumStarted;
+ audioGraph.UnrecoverableErrorOccurred -= OnAudioGraphUnrecoverableErrorOccurred;
+ try { audioGraph.Stop(); } catch { }
+ try { audioGraph.Dispose(); } catch { }
+ }
+ }
+ }
+
+ public ValueTask DisposeAsync()
+ {
+ return new ValueTask(StopAsync());
+ }
+
+ public async Task StopAsync()
+ {
+ AudioGraph? audioGraph;
+ AudioDeviceInputNode? deviceInputNode;
+ AudioFrameOutputNode? frameOutputNode;
+ string? deviceName;
+
+ lock (_gate)
+ {
+ audioGraph = _audioGraph;
+ _audioGraph = null;
+ deviceInputNode = _deviceInputNode;
+ _deviceInputNode = null;
+ frameOutputNode = _frameOutputNode;
+ _frameOutputNode = null;
+ deviceName = _activeCaptureDevice?.Name;
+ _activeCaptureDevice = null;
+ _sampleRateHz = 0;
+ _channelCount = 0;
+ }
+
+ if (audioGraph == null && deviceInputNode == null && frameOutputNode == null)
+ {
+ return;
+ }
+
+ if (audioGraph != null)
+ {
+ audioGraph.QuantumStarted -= OnAudioGraphQuantumStarted;
+ audioGraph.UnrecoverableErrorOccurred -= OnAudioGraphUnrecoverableErrorOccurred;
+ }
+
+ try { frameOutputNode?.Stop(); } catch { }
+ try { deviceInputNode?.Stop(); } catch { }
+ try { audioGraph?.Stop(); } catch { }
+
+ try { frameOutputNode?.Dispose(); } catch { }
+ try { deviceInputNode?.Dispose(); } catch { }
+ try { audioGraph?.Dispose(); } catch { }
+
+ await Task.CompletedTask;
+ _logger.Info($"Voice capture graph stopped{(string.IsNullOrWhiteSpace(deviceName) ? string.Empty : $" ({deviceName})")}");
+ }
+
+ internal static uint ResolveDesiredSamplesPerQuantum(int sampleRateHz, int chunkMs)
+ {
+ if (sampleRateHz <= 0)
+ {
+ sampleRateHz = 16000;
+ }
+
+ if (chunkMs <= 0)
+ {
+ chunkMs = 80;
+ }
+
+ var desired = (sampleRateHz * chunkMs) / 1000;
+ return (uint)Math.Max(desired, 128);
+ }
+
+ internal static bool HasAudibleSignal(float peakLevel, float threshold = DefaultSignalThreshold)
+ {
+ return peakLevel >= threshold;
+ }
+
+ internal static float ComputePeakLevel(byte[] data)
+ {
+ if (data.Length < sizeof(float))
+ {
+ return 0f;
+ }
+
+ float peak = 0f;
+ var alignedLength = data.Length - (data.Length % sizeof(float));
+ for (var offset = 0; offset < alignedLength; offset += sizeof(float))
+ {
+ var sample = Math.Abs(BitConverter.ToSingle(data, offset));
+ if (sample > peak)
+ {
+ peak = sample;
+ }
+ }
+
+ return float.IsFinite(peak) ? peak : 0f;
+ }
+
+ private async Task<DeviceInformation> ResolveCaptureDeviceAsync(string? preferredInputDeviceId)
+ {
+ var devices = await DeviceInformation.FindAllAsync(DeviceClass.AudioCapture);
+ if (devices.Count == 0)
+ {
+ throw new InvalidOperationException("No audio capture devices are available.");
+ }
+
+ if (!string.IsNullOrWhiteSpace(preferredInputDeviceId))
+ {
+ var selected = devices.FirstOrDefault(device =>
+ string.Equals(device.Id, preferredInputDeviceId, StringComparison.Ordinal));
+
+ if (selected != null)
+ {
+ return selected;
+ }
+
+ throw new InvalidOperationException($"Selected input device '{preferredInputDeviceId}' was not found.");
+ }
+
+ var defaultId = MediaDevice.GetDefaultAudioCaptureId(AudioDeviceRole.Default);
+ var defaultDevice = devices.FirstOrDefault(device =>
+ string.Equals(device.Id, defaultId, StringComparison.Ordinal));
+
+ return defaultDevice ?? devices[0];
+ }
+
+ private void OnAudioGraphUnrecoverableErrorOccurred(AudioGraph sender, AudioGraphUnrecoverableErrorOccurredEventArgs args)
+ {
+ _logger.Warn($"Voice capture graph unrecoverable error: {args.Error}");
+ }
+
+ private void OnAudioGraphQuantumStarted(AudioGraph sender, object args)
+ {
+ try
+ {
+ AudioFrameOutputNode? frameOutputNode;
+ string? deviceId;
+ string? deviceName;
+ int sampleRateHz;
+ int channelCount;
+
+ lock (_gate)
+ {
+ frameOutputNode = _frameOutputNode;
+ deviceId = _activeCaptureDevice?.Id;
+ deviceName = _activeCaptureDevice?.Name;
+ sampleRateHz = _sampleRateHz;
+ channelCount = _channelCount;
+ }
+
+ if (frameOutputNode == null)
+ {
+ return;
+ }
+
+ using var frame = frameOutputNode.GetFrame();
+ if (!TryCopyAudioFrame(frame, out var bytes) || bytes.Length == 0)
+ {
+ return;
+ }
+
+ var utcNow = DateTime.UtcNow;
+ var peak = ComputePeakLevel(bytes);
+ FrameCaptured?.Invoke(
+ this,
+ new VoiceAudioFrameEventArgs(
+ deviceId,
+ deviceName,
+ utcNow,
+ sampleRateHz,
+ channelCount,
+ bytes,
+ peak));
+
+ if (HasAudibleSignal(peak))
+ {
+ SignalDetected?.Invoke(
+ this,
+ new VoiceCaptureSignalEventArgs(
+ deviceId,
+ deviceName,
+ utcNow,
+ peak));
+ }
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice capture quantum processing failed: {ex.Message}");
+ }
+ }
+
+ private static bool TryCopyAudioFrame(AudioFrame frame, out byte[] bytes)
+ {
+ bytes = [];
+
+ using var buffer = frame.LockBuffer(AudioBufferAccessMode.Read);
+ using var reference = buffer.CreateReference();
+ var access = (IMemoryBufferByteAccess)reference;
+ access.GetBuffer(out var data, out var capacity);
+
+ if (data == IntPtr.Zero || capacity == 0)
+ {
+ return false;
+ }
+
+ bytes = new byte[capacity];
+ Marshal.Copy(data, bytes, 0, (int)capacity);
+ return true;
+ }
+
+ [ComImport]
+ [Guid("5B0D3235-4DBA-4D44-865E-8F1D0E4FD04D")]
+ [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
+ private interface IMemoryBufferByteAccess
+ {
+ void GetBuffer(out IntPtr buffer, out uint capacity);
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 5d107b2..b58e9f3 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -47,12 +47,14 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private OpenClawGatewayClient? _chatClient;
private ConnectionStatus _chatTransportStatus = ConnectionStatus.Disconnected;
private TaskCompletionSource<bool>? _transportReadyTcs;
+ private VoiceCaptureService? _voiceCaptureService;
private SpeechRecognizer? _speechRecognizer;
private SpeechSynthesizer? _speechSynthesizer;
private MediaPlayer? _mediaPlayer;
private bool _recognitionActive;
private int _recognitionSessionGeneration;
private bool _recognitionSessionHadActivity;
+ private bool _recognitionSessionHadCaptureSignal;
private bool _recognitionHealthCheckArmed;
private bool _recognitionRestartInProgress;
private bool _awaitingReply;
@@ -507,6 +509,7 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
await EnsureMicrophoneConsentAsync();
CancellationTokenSource? runtimeCts = null;
+ VoiceCaptureService? captureService = null;
SpeechRecognizer? recognizer = null;
SpeechSynthesizer? synthesizer = null;
MediaPlayer? player = null;
@@ -514,6 +517,9 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
try
{
runtimeCts = new CancellationTokenSource();
+ captureService = new VoiceCaptureService(_logger);
+ captureService.SignalDetected += OnCaptureSignalDetected;
+ await captureService.StartAsync(settings, runtimeCts.Token);
recognizer = await CreateSpeechRecognizerAsync(settings);
synthesizer = new SpeechSynthesizer();
player = new MediaPlayer();
@@ -521,7 +527,8 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
{
- _logger.Warn("Selected input device is saved, but Talk Mode currently uses the system speech input device.");
+ _logger.Warn(
+ "AudioGraph capture is bound to the selected input device, but Windows STT transcription still follows the system speech input path until the STT adapter migration is complete.");
}
recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
@@ -531,6 +538,7 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
lock (_gate)
{
_runtimeCts = runtimeCts;
+ _voiceCaptureService = captureService;
_speechRecognizer = recognizer;
_speechSynthesizer = synthesizer;
_mediaPlayer = player;
@@ -575,6 +583,13 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
}
else
{
+ if (captureService != null)
+ {
+ captureService.SignalDetected -= OnCaptureSignalDetected;
+ try { await captureService.StopAsync(); } catch { }
+ try { await captureService.DisposeAsync(); } catch { }
+ }
+
if (recognizer != null)
{
try { recognizer.HypothesisGenerated -= OnSpeechHypothesisGenerated; } catch { }
@@ -742,6 +757,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
runtimeToken = _runtimeCts.Token;
generation = ++_recognitionSessionGeneration;
+ _recognitionSessionHadCaptureSignal = false;
}
_logger.Info("Starting speech recognition session");
@@ -842,6 +858,7 @@ private async Task StopRecognitionSessionAsync()
_recognitionActive = false;
_recognitionHealthCheckArmed = false;
+ _recognitionSessionHadCaptureSignal = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
}
@@ -1008,6 +1025,7 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscriptUtc = DateTime.UtcNow;
_recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
+ _recognitionSessionHadCaptureSignal = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
cancellationToken = _runtimeCts.Token;
@@ -1439,6 +1457,7 @@ internal static bool ShouldRestartRecognitionAfterCompletion(
internal static bool ShouldRebuildRecognitionAfterCompletion(
SpeechRecognitionResultStatus status,
bool sessionHadActivity,
+ bool sessionHadCaptureSignal,
bool restartInProgress,
bool awaitingReply,
bool isSpeaking)
@@ -1448,7 +1467,8 @@ internal static bool ShouldRebuildRecognitionAfterCompletion(
return false;
}
- return status == SpeechRecognitionResultStatus.UserCanceled ||
+ return sessionHadCaptureSignal ||
+ status == SpeechRecognitionResultStatus.UserCanceled ||
status == SpeechRecognitionResultStatus.TimeoutExceeded;
}
@@ -1557,6 +1577,7 @@ private async void OnSpeechRecognitionCompleted(
var shouldRebuildRecognizer = false;
var restartInProgress = false;
var sessionHadActivity = false;
+ var sessionHadCaptureSignal = false;
lock (_gate)
{
@@ -1567,7 +1588,9 @@ private async void OnSpeechRecognitionCompleted(
_recognitionActive = false;
sessionHadActivity = _recognitionSessionHadActivity;
+ sessionHadCaptureSignal = _recognitionSessionHadCaptureSignal;
_recognitionSessionHadActivity = false;
+ _recognitionSessionHadCaptureSignal = false;
restartInProgress = _recognitionRestartInProgress;
if (restartInProgress)
{
@@ -1590,13 +1613,14 @@ private async void OnSpeechRecognitionCompleted(
shouldRebuildRecognizer = ShouldRebuildRecognitionAfterCompletion(
args.Status,
sessionHadActivity,
+ sessionHadCaptureSignal,
restartInProgress,
_awaitingReply,
_isSpeaking);
}
_logger.Warn(
- $"Speech recognition session completed with status {args.Status}; restart={shouldRestart}; rebuild={shouldRebuildRecognizer}; hadActivity={sessionHadActivity}");
+ $"Speech recognition session completed with status {args.Status}; restart={shouldRestart}; rebuild={shouldRebuildRecognizer}; hadActivity={sessionHadActivity}; hadCaptureSignal={sessionHadCaptureSignal}");
if (shouldRestart && !token.IsCancellationRequested)
{
@@ -1671,6 +1695,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
CancellationTokenSource? runtimeCts;
CancellationTokenSource? playbackSkipCts;
OpenClawGatewayClient? chatClient;
+ VoiceCaptureService? captureService;
SpeechRecognizer? recognizer;
SpeechSynthesizer? synthesizer;
MediaPlayer? player;
@@ -1686,10 +1711,14 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_chatTransportStatus = ConnectionStatus.Disconnected;
_transportReadyTcs = null;
+ captureService = _voiceCaptureService;
+ _voiceCaptureService = null;
+
recognizer = _speechRecognizer;
_speechRecognizer = null;
_recognitionActive = false;
_recognitionSessionHadActivity = false;
+ _recognitionSessionHadCaptureSignal = false;
_recognitionRestartInProgress = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
@@ -1714,6 +1743,13 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
try { runtimeCts?.Cancel(); } catch { }
try { playbackSkipCts?.Cancel(); } catch { }
+ if (captureService != null)
+ {
+ captureService.SignalDetected -= OnCaptureSignalDetected;
+ try { await captureService.StopAsync(); } catch { }
+ try { await captureService.DisposeAsync(); } catch { }
+ }
+
if (recognizer != null)
{
recognizer.HypothesisGenerated -= OnSpeechHypothesisGenerated;
@@ -1904,6 +1940,7 @@ private async void OnDefaultAudioCaptureDeviceChanged(object sender, DefaultAudi
await StopRecognitionSessionAsync();
}
+ await RebuildVoiceCaptureAsync("default capture device changed", token);
await RebuildSpeechRecognizerAsync("default capture device changed", token);
if (shouldRestartListening && !token.IsCancellationRequested)
@@ -1928,6 +1965,24 @@ private void ArmRecognitionHealthCheck()
}
}
+ private void OnCaptureSignalDetected(object? sender, VoiceCaptureSignalEventArgs args)
+ {
+ lock (_gate)
+ {
+ if (_runtimeCts == null ||
+ !_status.Running ||
+ _status.Mode != VoiceActivationMode.TalkMode ||
+ !_recognitionActive ||
+ _awaitingReply ||
+ _isSpeaking)
+ {
+ return;
+ }
+
+ _recognitionSessionHadCaptureSignal = true;
+ }
+ }
+
private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken cancellationToken)
{
SpeechRecognizer? oldRecognizer;
@@ -1946,6 +2001,7 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
_speechRecognizer = null;
_recognitionActive = false;
_recognitionSessionHadActivity = false;
+ _recognitionSessionHadCaptureSignal = false;
_recognitionHealthCheckArmed = false;
}
@@ -1991,6 +2047,33 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
}
}
+ private async Task RebuildVoiceCaptureAsync(string reason, CancellationToken cancellationToken)
+ {
+ VoiceCaptureService? captureService;
+ VoiceSettings settings;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null || _runtimeCts.IsCancellationRequested)
+ {
+ return;
+ }
+
+ captureService = _voiceCaptureService;
+ settings = Clone(_settings.Voice);
+ _recognitionSessionHadCaptureSignal = false;
+ }
+
+ if (captureService == null)
+ {
+ return;
+ }
+
+ cancellationToken.ThrowIfCancellationRequested();
+ await captureService.StartAsync(settings, cancellationToken);
+ _logger.Info($"Voice capture graph rebuilt ({reason})");
+ }
+
private async Task MonitorRecognitionSessionHealthAsync(int generation, CancellationToken cancellationToken)
{
try
@@ -1998,8 +2081,10 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
await Task.Delay(RecognitionHealthCheckDelay, cancellationToken);
var shouldRecycle = false;
+ var sawCaptureSignal = false;
lock (_gate)
{
+ sawCaptureSignal = _recognitionSessionHadCaptureSignal;
shouldRecycle =
_recognitionHealthCheckArmed &&
_recognitionActive &&
@@ -2028,12 +2113,12 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
}
_logger.Warn(
- $"Speech recognition session produced no hypotheses/results within {RecognitionHealthCheckDelay.TotalSeconds:0}s; recycling session");
+ $"Speech recognition session produced no hypotheses/results within {RecognitionHealthCheckDelay.TotalSeconds:0}s; recycling session (captureSignal={sawCaptureSignal})");
await StopRecognitionSessionAsync();
await ResumeRecognitionSessionAsync(
cancellationToken,
"recognition health check",
- rebuildRecognizer: true);
+ rebuildRecognizer: sawCaptureSignal);
}
catch (OperationCanceledException)
{
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 9052638..0916234 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -152,16 +152,18 @@ public void ShouldRestartRecognitionAfterCompletion_SuppressesControlledRecycle(
}
[Theory]
- [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, true)]
- [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, true)]
- [InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false)]
- [InlineData(SpeechRecognitionResultStatus.UserCanceled, true, false, false, false, false)]
- [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, true, false, false, false)]
- [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, true, false, false)]
- [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, true, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, true, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, true, false, false, false, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, true, false, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, true, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, true, false)]
public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledSessions(
SpeechRecognitionResultStatus status,
bool sessionHadActivity,
+ bool sessionHadCaptureSignal,
bool restartInProgress,
bool awaitingReply,
bool isSpeaking,
@@ -173,11 +175,50 @@ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledS
var result = (bool)method.Invoke(
null,
- [status, sessionHadActivity, restartInProgress, awaitingReply, isSpeaking])!;
+ [status, sessionHadActivity, sessionHadCaptureSignal, restartInProgress, awaitingReply, isSpeaking])!;
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(16000, 80, 1280)]
+ [InlineData(16000, 0, 1280)]
+ [InlineData(0, 80, 1280)]
+ [InlineData(48000, 20, 960)]
+ public void ResolveDesiredSamplesPerQuantum_UsesSpeechFriendlyDefaults(
+ int sampleRateHz,
+ int chunkMs,
+ uint expected)
+ {
+ var method = typeof(VoiceCaptureService).GetMethod(
+ "ResolveDesiredSamplesPerQuantum",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (uint)method.Invoke(null, [sampleRateHz, chunkMs])!;
+
+ Assert.Equal(expected, result);
+ }
+
+ public static IEnumerable<object[]> PeakLevelCases()
+ {
+ yield return [new byte[] { 0, 0, 0, 0 }, 0f];
+ yield return [new byte[] { 0, 0, 0, 63 }, 0.5f];
+ yield return [new byte[] { 0, 0, 128, 63, 0, 0, 0, 191 }, 1f];
+ }
+
+ [Theory]
+ [MemberData(nameof(PeakLevelCases))]
+ public void ComputePeakLevel_FindsLargestAbsoluteFloatSample(byte[] data, float expected)
+ {
+ var method = typeof(VoiceCaptureService).GetMethod(
+ "ComputePeakLevel",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (float)method.Invoke(null, [data])!;
+
+ Assert.Equal(expected, result, 3);
+ }
+
[Theory]
[InlineData("Now again testing", "again testing", 1, true, "Now again testing")]
[InlineData("again testing", "again testing", 1, false, "again testing")]
From 339a969a8e23a14a029f18dbba227245486f8620 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 20:51:55 +0000
Subject: [PATCH 49/83] Fix AudioGraph frame buffer interop
---
docs/VOICE-MODE.md | 1 +
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index eddbfe5..4f33c06 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -885,3 +885,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Hardened the gateway preview fallback so a bare final chat event does not replay the previous assistant reply when `sessions.preview` lags behind the real session update.
- `2026-03-25` Updated the voice-mode architecture document with an accurate current Talk Mode flow, the planned `AudioGraph` input design, the STT adapter seam, and the selected-device roadmap.
- `2026-03-25` Added `VoiceCaptureService` on `AudioGraph`, wired it into Talk Mode lifecycle, and started using live capture signal as part of recognizer health and device-refresh handling while transcript generation still remains on the Windows speech recognizer.
+- `2026-03-25` Fixed `VoiceCaptureService` frame-buffer interop for the current WinRT projection so AudioGraph quantum processing can read frame data instead of throwing invalid-cast warnings on every capture quantum.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
index b0ebc57..12ed036 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
@@ -4,6 +4,7 @@
using System.Threading;
using System.Threading.Tasks;
using OpenClaw.Shared;
+using WinRT;
using Windows.Devices.Enumeration;
using Windows.Media;
using Windows.Media.Audio;
@@ -396,7 +397,7 @@ private static bool TryCopyAudioFrame(AudioFrame frame, out byte[] bytes)
using var buffer = frame.LockBuffer(AudioBufferAccessMode.Read);
using var reference = buffer.CreateReference();
- var access = (IMemoryBufferByteAccess)reference;
+ var access = reference.As<IMemoryBufferByteAccess>();
access.GetBuffer(out var data, out var capacity);
if (data == IntPtr.Zero || capacity == 0)
From ee64cb9f38555b78a3216787bd59a08731540d32 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 21:01:40 +0000
Subject: [PATCH 50/83] Gate talk mode listening on capture readiness
---
docs/VOICE-MODE.md | 2 +
.../Services/Voice/VoiceCaptureService.cs | 38 ++++++
.../Services/Voice/VoiceService.cs | 109 ++++++++++++------
3 files changed, 114 insertions(+), 35 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 4f33c06..0a41036 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -537,6 +537,7 @@ At runtime today:
- `Voice.OutputDeviceId` is applied to Talk Mode playback through `MediaPlayer.AudioDevice`
- `VoiceCaptureService` now runs an `AudioGraph` capture pipeline in parallel with Talk Mode and binds it to the selected or default microphone device
- `Voice.InputDeviceId` is now used by that `AudioGraph` capture path, but transcript generation still uses the Windows default speech input path until the STT adapter migration is complete
+- Talk Mode only advertises `ListeningContinuously` after the capture graph has produced live frames and the recognizer warm-up window has elapsed, so the status acts as a real ΓÇ£you can start talking nowΓÇ¥ signal instead of a timer-only guess
- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
- explicit non-default microphone transcript generation is still pending the planned STT adapter migration
@@ -886,3 +887,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Updated the voice-mode architecture document with an accurate current Talk Mode flow, the planned `AudioGraph` input design, the STT adapter seam, and the selected-device roadmap.
- `2026-03-25` Added `VoiceCaptureService` on `AudioGraph`, wired it into Talk Mode lifecycle, and started using live capture signal as part of recognizer health and device-refresh handling while transcript generation still remains on the Windows speech recognizer.
- `2026-03-25` Fixed `VoiceCaptureService` frame-buffer interop for the current WinRT projection so AudioGraph quantum processing can read frame data instead of throwing invalid-cast warnings on every capture quantum.
+- `2026-03-25` Changed Talk Mode readiness so `Listening` only appears after the AudioGraph capture path is delivering frames and the recognizer warm-up completes, making it a user-facing ΓÇ£start talking nowΓÇ¥ signal rather than a timer-only state.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
index 12ed036..e50b179 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCaptureService.cs
@@ -76,6 +76,8 @@ public sealed class VoiceCaptureService : IAsyncDisposable
private DeviceInformation? _activeCaptureDevice;
private int _sampleRateHz;
private int _channelCount;
+ private bool _captureReady;
+ private TaskCompletionSource<bool> _captureReadyTcs = CreateCaptureReadyTcs();
public VoiceCaptureService(IOpenClawLogger logger)
{
@@ -125,6 +127,12 @@ public async Task StartAsync(VoiceSettings settings, CancellationToken cancellat
await StopAsync();
cancellationToken.ThrowIfCancellationRequested();
+ lock (_gate)
+ {
+ _captureReady = false;
+ _captureReadyTcs = CreateCaptureReadyTcs();
+ }
+
AudioGraph? audioGraph = null;
AudioDeviceInputNode? deviceInputNode = null;
AudioFrameOutputNode? frameOutputNode = null;
@@ -256,6 +264,18 @@ public async Task StopAsync()
_logger.Info($"Voice capture graph stopped{(string.IsNullOrWhiteSpace(deviceName) ? string.Empty : $" ({deviceName})")}");
}
+ public Task WaitForCaptureReadyAsync(CancellationToken cancellationToken)
+ {
+ Task readinessTask;
+
+ lock (_gate)
+ {
+ readinessTask = _captureReady ? Task.CompletedTask : _captureReadyTcs.Task;
+ }
+
+ return readinessTask.WaitAsync(cancellationToken);
+ }
+
internal static uint ResolveDesiredSamplesPerQuantum(int sampleRateHz, int chunkMs)
{
if (sampleRateHz <= 0)
@@ -361,6 +381,19 @@ private void OnAudioGraphQuantumStarted(AudioGraph sender, object args)
return;
}
+ TaskCompletionSource<bool>? captureReadyTcs = null;
+
+ lock (_gate)
+ {
+ if (!_captureReady)
+ {
+ _captureReady = true;
+ captureReadyTcs = _captureReadyTcs;
+ }
+ }
+
+ captureReadyTcs?.TrySetResult(true);
+
var utcNow = DateTime.UtcNow;
var peak = ComputePeakLevel(bytes);
FrameCaptured?.Invoke(
@@ -410,6 +443,11 @@ private static bool TryCopyAudioFrame(AudioFrame frame, out byte[] bytes)
return true;
}
+ private static TaskCompletionSource<bool> CreateCaptureReadyTcs()
+ {
+ return new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+ }
+
[ComImport]
[Guid("5B0D3235-4DBA-4D44-865E-8F1D0E4FD04D")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index b58e9f3..cf27e94 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -550,23 +550,8 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
}
await EnsureChatTransportAsync(runtimeCts.Token);
- await StartRecognitionSessionAsync(updateListeningStatus: false);
+ await StartRecognitionSessionAsync();
ArmRecognitionHealthCheck();
- await Task.Delay(InitialRecognitionReadyDelay, runtimeCts.Token);
-
- lock (_gate)
- {
- if (_status.Running)
- {
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- effectiveSessionKey,
- VoiceRuntimeState.ListeningContinuously,
- fallbackMessage);
- }
- }
-
- _logger.Info($"Speech recognition warm-up completed ({InitialRecognitionReadyDelay.TotalMilliseconds:0}ms)");
_logger.Info("Voice runtime started in mode TalkMode");
}
catch
@@ -774,15 +759,81 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- null);
+ VoiceRuntimeState.Arming,
+ _status.LastError);
}
}
_logger.Info("Speech recognition session started");
+ if (updateListeningStatus)
+ {
+ _ = MonitorListeningReadyAsync(generation, runtimeToken);
+ }
+
_ = MonitorRecognitionSessionHealthAsync(generation, runtimeToken);
}
+ private async Task MonitorListeningReadyAsync(int generation, CancellationToken cancellationToken)
+ {
+ try
+ {
+ VoiceCaptureService? captureService;
+
+ lock (_gate)
+ {
+ captureService = _voiceCaptureService;
+ }
+
+ if (captureService == null)
+ {
+ return;
+ }
+
+ await captureService.WaitForCaptureReadyAsync(cancellationToken);
+ await Task.Delay(InitialRecognitionReadyDelay, cancellationToken);
+
+ var transitionedToListening = false;
+
+ lock (_gate)
+ {
+ if (_runtimeCts == null ||
+ _runtimeCts.IsCancellationRequested ||
+ !_status.Running ||
+ _status.Mode != VoiceActivationMode.TalkMode ||
+ !_recognitionActive ||
+ _recognitionSessionGeneration != generation ||
+ _awaitingReply ||
+ _isSpeaking)
+ {
+ return;
+ }
+
+ if (_status.State != VoiceRuntimeState.ListeningContinuously)
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.ListeningContinuously,
+ _status.LastError);
+ transitionedToListening = true;
+ }
+ }
+
+ if (transitionedToListening)
+ {
+ _logger.Info(
+ $"Speech pipeline ready; capture frames observed and recognizer warm-up completed ({InitialRecognitionReadyDelay.TotalMilliseconds:0}ms)");
+ }
+ }
+ catch (OperationCanceledException)
+ {
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice listening readiness check failed: {ex.Message}");
+ }
+ }
+
private async Task ResumeRecognitionSessionAsync(
CancellationToken cancellationToken,
string reason,
@@ -1118,7 +1169,7 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
+ VoiceRuntimeState.Arming,
"Timed out waiting for an assistant reply.");
shouldResume = true;
}
@@ -1201,7 +1252,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
+ VoiceRuntimeState.Arming,
_status.LastError);
shouldResumeRecognition = true;
}
@@ -1276,7 +1327,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
+ VoiceRuntimeState.Arming,
_status.LastError);
}
@@ -1311,7 +1362,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- shouldPauseBeforeNextReply ? VoiceRuntimeState.PlayingResponse : VoiceRuntimeState.ListeningContinuously,
+ shouldPauseBeforeNextReply ? VoiceRuntimeState.PlayingResponse : VoiceRuntimeState.Arming,
GetUserFacingErrorMessage(ex));
}
}
@@ -1349,7 +1400,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
_status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
+ VoiceRuntimeState.Arming,
_status.LastError);
}
}
@@ -1649,18 +1700,6 @@ private void OnChatTransportStatusChanged(object? sender, ConnectionStatus statu
if (status == ConnectionStatus.Connected)
{
_transportReadyTcs?.TrySetResult(true);
-
- if (_status.Running &&
- _status.Mode == VoiceActivationMode.TalkMode &&
- !_awaitingReply &&
- !_isSpeaking)
- {
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.ListeningContinuously,
- _status.LastError);
- }
}
else if (status == ConnectionStatus.Error)
{
From cc1ab5c956afe00cbd2a850e5ad336f2ee65408d Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 21:06:01 +0000
Subject: [PATCH 51/83] Document macOS voice parity backlog
---
docs/VOICE-MODE.md | 166 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 166 insertions(+)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 0a41036..f25ba4d 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -778,8 +778,92 @@ The Windows node still keeps provider choice bounded:
This keeps the provider surface narrow while still meeting the required MiniMax/ElevenLabs support direction.
+## Parity with macOS Node
+
+Status values used below:
+
+- `Supported`
+- `Partial`
+- `NotSupported (planned)`
+- `Exceeded*`
+
+| macOS feature | Current Windows state | Notes |
+|---|---|---|
+| Talk Mode continuous loop (`listen -> chat.send(main) -> wait -> speak`) | `Supported` | Windows Talk Mode uses direct `chat.send` on the active main session and loops back to listening after reply playback. |
+| Talk Mode sends after a short silence window | `Supported` | The current runtime finalizes on recognition pause and uses configurable Talk Mode silence settings. |
+| Talk Mode visible phase transitions (`Listening -> Thinking -> Speaking`) | `Partial` | Runtime states and tray icon changes exist, but there is no always-visible overlay yet. |
+| Talk Mode always-on overlay with click-to-stop / click-X controls | `NotSupported (planned)` | Windows currently has a tray icon, status window, and draft mirroring, but no overlay surface. |
+| Talk Mode writes replies into WebChat the same way typed chat does | `Partial` | Replies appear in WebChat through normal session updates, but Talk Mode uses direct send rather than a same-as-typing transport path. |
+| Talk Mode interrupt-on-speech / barge-in | `NotSupported (planned)` | Windows is still half-duplex during reply playback. |
+| Talk Mode voice directives in replies | `NotSupported (planned)` | Windows does not yet parse or apply the JSON voice directive line described in the Talk Mode docs. |
+| Talk Mode true streaming TTS playback | `NotSupported (planned)` | MiniMax uses WebSocket transport, but playback still waits for a complete playable stream. |
+| Talk Mode cloud TTS provider flexibility | `Exceeded*` | Windows already supports Windows built-in TTS plus catalog-driven cloud providers rather than being limited to a single provider path.[^parity-tts] |
+| Voice Wake wake-word runtime | `NotSupported (planned)` | `VoiceWake` remains a documented target mode, but there is no active wake-word runtime yet. |
+| Voice Wake push-to-talk capture | `NotSupported (planned)` | There is no Windows push-to-talk path yet. |
+| Voice Wake overlay with committed / volatile transcript states | `NotSupported (planned)` | No Voice Wake overlay exists on Windows yet. |
+| Voice Wake restart invariants when UI is dismissed | `NotSupported (planned)` | The macOS overlay-dismiss resilience behavior has no Windows equivalent yet because the overlay/runtime does not exist. |
+| Voice Wake forwarding to the active gateway / agent | `NotSupported (planned)` | Forwarding semantics are only implemented for Talk Mode today. |
+| Voice Wake machine-hint transcript prefixing | `NotSupported (planned)` | Windows does not currently prepend a machine hint on forwarded wake transcripts. |
+| Voice Wake mic picker, live level meter, trigger-word table, and tester | `NotSupported (planned)` | Windows has general voice settings and device lists, but not the Voice Wake-specific settings surface from macOS. |
+| Voice mic device selection | `Partial` | Selected output device is implemented; selected microphone binding exists in `AudioGraph`, but actual transcript generation still follows the Windows speech-input path. |
+| Voice Wake send / trigger chimes | `NotSupported (planned)` | Windows currently has no configurable trigger/send sounds. |
+
+[^parity-tts]: Windows supports provider-catalog TTS contracts with Windows built-in, MiniMax, and ElevenLabs entries today, whereas the documented macOS baseline is ElevenLabs-centric. Windows does not yet exceed macOS on true streaming playback latency because incremental playback is still pending.
+
## Feature List (Backlog)
+### Story: True selected-microphone transcription support
+
+Make actual STT transcription follow the selected microphone device, not just the `AudioGraph` capture path.
+
+Notes:
+
+- current Windows Talk Mode capture can bind to the selected mic through `VoiceCaptureService`
+- final transcript generation still follows the Windows speech-input path rather than the selected device id
+- the implementation should complete the planned `AudioGraph` -> `ISpeechToTextAdapter` migration so the chosen microphone controls the whole input pipeline
+
+### Story: Talk Mode overlay and visible phase parity
+
+Add a Talk Mode overlay that makes `Listening`, `Thinking`, and `Speaking` visible to the user in the same way the macOS experience does.
+
+Notes:
+
+- the current tray icon and status window are not equivalent to an always-visible Talk Mode surface
+- the overlay should expose phase transitions clearly and support later stop / dismiss controls
+- this should be designed alongside the existing compact voice-strip idea so the two UI surfaces do not conflict
+
+### Story: Talk Mode overlay controls
+
+Add explicit Talk Mode overlay controls for stopping speech playback and exiting Talk Mode.
+
+Notes:
+
+- macOS exposes click-to-stop and click-to-exit controls directly on the overlay
+- Windows currently requires tray or settings interaction instead
+- this should plug into the shared runtime control API rather than directly manipulating `VoiceService`
+
+### Story: Same-as-typing WebChat parity for Talk Mode
+
+Decide whether Windows Talk Mode should optionally route sent transcripts through a typed-chat equivalent path so WebChat behavior matches manual typing more closely.
+
+Notes:
+
+- current Windows Talk Mode intentionally uses direct `chat.send`
+- replies still appear in WebChat through session updates, but the send path is not literally the same as typed WebChat submission
+- this should only be revisited if the memory / prompt-shaping issues can be fixed without reintroducing transport fragility
+
+### Story: Voice directives in replies
+
+Support the Talk Mode reply-prefix JSON directive described in the OpenClaw docs.
+
+Notes:
+
+- parse only the first non-empty reply line
+- strip the directive before playback
+- support per-reply `once: true` and persistent default updates
+- supported keys should at least include voice, model, and the documented voice-shaping parameters
+- provider-specific validation should happen through the provider contract layer where possible
+
### Story: Support non-local (or non-Windows, local) STT providers
Allow the user to select a non-local STT provider like OpenAI Whisper, or a local non-Windows recognizer, instead of being locked to the Windows built-in path.
@@ -805,6 +889,88 @@ Notes:
- additional runtime control/status so the UI can show when barge-in is armed
- this should be treated as a separate engineering phase, not a small extension of the current Talk Mode runtime
+### Story: Voice Wake wake-word runtime
+
+Implement the actual Windows Voice Wake runtime.
+
+Notes:
+
+- this should cover wake-word listening, trigger detection, post-trigger capture, silence finalization, hard-stop protection, and debounce between sessions
+- the runtime should restart cleanly after send and should remain armed whenever Voice Wake is enabled and permissions are available
+- the implementation should be based on the planned `AudioGraph` capture pipeline rather than a second unrelated microphone stack
+
+### Story: Voice Wake push-to-talk
+
+Implement a Windows push-to-talk capture path alongside wake-word activation.
+
+Notes:
+
+- this should support press-to-capture, release-to-finalize semantics
+- it should pause the wake runtime while push-to-talk capture is active, then resume it cleanly afterward
+- Windows-specific hotkey and permissions behavior should be documented explicitly once chosen
+
+### Story: Voice Wake overlay lifecycle
+
+Add the Windows Voice Wake overlay and harden its lifecycle so dismissing or hiding the UI can never leave the wake runtime dead.
+
+Notes:
+
+- support committed and volatile transcript presentation
+- manual dismiss must never block recognizer restart
+- overlay and runtime should be coordinated through a session controller rather than direct UI coupling
+
+### Story: Voice Wake settings parity
+
+Add the user-facing Voice Wake settings surface that exists on macOS.
+
+Notes:
+
+- include language and mic pickers
+- include a live level meter
+- include trigger-word editing or table management
+- include a local-only tester that does not forward
+- preserve the chosen mic if it disconnects, surface a disconnected hint, and fall back to the system default until it returns
+
+### Story: Voice Wake sounds and chimes
+
+Add configurable trigger and send sounds for Voice Wake.
+
+Notes:
+
+- trigger and send events should be independently configurable
+- support `No Sound`
+- keep the sound implementation distinct from assistant reply playback
+
+### Story: Voice Wake forwarding semantics
+
+Implement the documented Voice Wake forwarding behavior.
+
+Notes:
+
+- forwarded transcripts should go to the active gateway / agent path
+- reply delivery and logging behavior should match the rest of the node session model
+- the forwarding path should be resilient even when UI surfaces are closed
+
+### Story: Voice Wake machine-hint prefixing
+
+Implement the documented transcript prefixing / machine-hint behavior for forwarded Voice Wake utterances.
+
+Notes:
+
+- the prefixing rule should be explicit and testable
+- both wake-word and push-to-talk paths should share the same forwarding helper
+
+### Story: Voice Wake trigger tuning and pause semantics
+
+Implement the documented Voice Wake trigger-gap, silence-window, hard-stop, and debounce semantics.
+
+Notes:
+
+- include the wake-word gap behavior before command capture begins
+- support distinct silence windows for trigger-only vs flowing speech cases
+- include a hard maximum capture duration
+- expose the tuning through voice settings rather than hard-coded constants alone
+
### Story: Compact Voice Status Strip
From a3059027a925c7277e815ecdbeaa2ea84c2d0416 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 21:17:38 +0000
Subject: [PATCH 52/83] Inline parity exceeded notes
---
docs/VOICE-MODE.md | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index f25ba4d..83f7e15 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -10,13 +10,15 @@ This document defines the voice subsystem for the Windows node only. It introduc
- Present the user-facing mode names as `Voice Wake` and `Talk Mode`
- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
+- Make adding new voice providers an update to a Json catalog, rather than requiring code changes
- Reuse the existing node capability pattern instead of introducing a parallel control path
+- Ensure that the voice sub-system is extensible
+- Ensure that the voice sub-system is controllable from other applications
## Non-Goals
- True full-duplex or chunk-streaming audio transport between node and gateway
-- Arbitrary provider proliferation before the required `MiniMax` / `ElevenLabs` TTS support is in place
-- Changes to unrelated project documentation
+- Subtantial changes to the existing project
## Design Position
@@ -34,7 +36,7 @@ This keeps the Windows node lean for the first implementation and avoids introdu
## Visible Mode Names
-The tray app now uses user-facing names rather than exposing the internal enum names directly:
+The tray app now uses user-facing names (borrowed from the macOS app) rather than exposing the internal enum names directly:
| Internal Mode | Visible Name | Availability |
|---|---|---|
@@ -49,19 +51,19 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
`TalkMode` follows the current talk-mode style control flow:
- the node captures audio locally
-- local speech recognition turns that audio into transcript text
+- local or remote speech recognition turns that audio into transcript text
- interim hypotheses are surfaced live, but only final `Medium` or `High` confidence recognizer results are submitted
- the tray chat window, when open, mirrors the live transcript draft locally
- the finalized transcript is always sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
-- the node performs local TTS playback of that reply
-- assistant replies are queued locally and spoken sequentially, with a short 500 ms pause between queued replies so overlapping responses are not lost
+- the node performs local or remote TTS playback of that reply
+- assistant replies are queued locally and spoken sequentially, with a short (500 ms currently) pause between queued replies so overlapping responses are not lost
- if a reply arrives after the normal 45-second wait timeout, the tray still accepts and speaks that late reply for a short bounded grace window so slow upstream responses are not silently lost
- the tray chat window can optionally strip injected `<relevant-memories>...</relevant-memories>` blocks from the rendered display without changing the underlying upstream message
To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
-That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, not part of this design.
+That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, and therefore not part of this design.
The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `TalkMode`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
@@ -155,7 +157,7 @@ The main remaining gap is streaming playback from the first audio chunk. The Azu
- MiniMax now uses the provider catalog's WebSocket TTS contract, but the current player still waits for a complete playable stream before output starts
- ElevenLabs is currently integrated through the non-streaming convert contract in the provider catalog
-So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming.
+So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming. This is, however, planned for an early release.
## Tray Chat Integration Decision
@@ -202,11 +204,14 @@ The embedded [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatW
This is intentionally a tray-local integration decision, not a protocol-level rewrite of the stored upstream transcript.
+It also fits with the planned voice mode *repeater form*, which will act as an optional small display and control surface whilst voice mode is in operation.
+
### Tradeoffs
- preserves a single visible conversation for the user
- avoids a second voice-only session in the tray UI
- uses only one send path for voice turns, which is simpler to reason about and debug
+- requires us to change the existing project, but not too significantly
- keeps a light DOM integration inside the embedded WebView chat surface for draft mirroring only
- only affects the tray app chat window; other clients still render upstream content according to their own rules
@@ -223,9 +228,9 @@ Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
- built-in catalog entries exist for both `minimax` and `elevenlabs` TTS
-- `minimax` defaults to `speech-2.8-turbo` and `English_MatureBoss`
+- `minimax` defaults to `speech-2.8-turbo` and `English_MatureBoss` at present
- `minimax` now uses a catalog-driven WebSocket contract for synchronous TTS
-- `elevenlabs` defaults to `eleven_multilingual_v2` and a user-supplied voice id
+- `elevenlabs` defaults to `eleven_multilingual_v2` and voice id `6aDn1KB0hjpdcocrUkmq (Tiffany)` for now
- non-Windows providers can be selected and persisted now
- unsupported providers fall back to Windows at runtime with a status warning
@@ -797,7 +802,7 @@ Status values used below:
| Talk Mode interrupt-on-speech / barge-in | `NotSupported (planned)` | Windows is still half-duplex during reply playback. |
| Talk Mode voice directives in replies | `NotSupported (planned)` | Windows does not yet parse or apply the JSON voice directive line described in the Talk Mode docs. |
| Talk Mode true streaming TTS playback | `NotSupported (planned)` | MiniMax uses WebSocket transport, but playback still waits for a complete playable stream. |
-| Talk Mode cloud TTS provider flexibility | `Exceeded*` | Windows already supports Windows built-in TTS plus catalog-driven cloud providers rather than being limited to a single provider path.[^parity-tts] |
+| Talk Mode cloud TTS provider flexibility | `Exceeded` | Windows already supports Windows built-in TTS plus catalog-driven cloud providers rather than being limited to a single provider path. This exceeds the documented macOS baseline on provider flexibility, but not yet on true streaming playback latency because incremental playback is still pending. |
| Voice Wake wake-word runtime | `NotSupported (planned)` | `VoiceWake` remains a documented target mode, but there is no active wake-word runtime yet. |
| Voice Wake push-to-talk capture | `NotSupported (planned)` | There is no Windows push-to-talk path yet. |
| Voice Wake overlay with committed / volatile transcript states | `NotSupported (planned)` | No Voice Wake overlay exists on Windows yet. |
@@ -808,8 +813,6 @@ Status values used below:
| Voice mic device selection | `Partial` | Selected output device is implemented; selected microphone binding exists in `AudioGraph`, but actual transcript generation still follows the Windows speech-input path. |
| Voice Wake send / trigger chimes | `NotSupported (planned)` | Windows currently has no configurable trigger/send sounds. |
-[^parity-tts]: Windows supports provider-catalog TTS contracts with Windows built-in, MiniMax, and ElevenLabs entries today, whereas the documented macOS baseline is ElevenLabs-centric. Windows does not yet exceed macOS on true streaming playback latency because incremental playback is still pending.
-
## Feature List (Backlog)
### Story: True selected-microphone transcription support
From 5ca4af28a049de7968ce7be8f2571767f3909a69 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 21:33:17 +0000
Subject: [PATCH 53/83] Stop recycling talk mode on silence
---
docs/VOICE-MODE.md | 2 +
.../Services/Voice/VoiceService.cs | 71 +++++++++++++++----
.../VoiceServiceTransportTests.cs | 38 ++++++++++
3 files changed, 98 insertions(+), 13 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 83f7e15..cfec33b 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -543,6 +543,7 @@ At runtime today:
- `VoiceCaptureService` now runs an `AudioGraph` capture pipeline in parallel with Talk Mode and binds it to the selected or default microphone device
- `Voice.InputDeviceId` is now used by that `AudioGraph` capture path, but transcript generation still uses the Windows default speech input path until the STT adapter migration is complete
- Talk Mode only advertises `ListeningContinuously` after the capture graph has produced live frames and the recognizer warm-up window has elapsed, so the status acts as a real ΓÇ£you can start talking nowΓÇ¥ signal instead of a timer-only guess
+- recognizer recovery is now speech-triggered rather than silence-triggered: the Windows recognizer is only recycled when sustained capture speech is present but no recognition activity follows
- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
- explicit non-default microphone transcript generation is still pending the planned STT adapter migration
@@ -1057,3 +1058,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Added `VoiceCaptureService` on `AudioGraph`, wired it into Talk Mode lifecycle, and started using live capture signal as part of recognizer health and device-refresh handling while transcript generation still remains on the Windows speech recognizer.
- `2026-03-25` Fixed `VoiceCaptureService` frame-buffer interop for the current WinRT projection so AudioGraph quantum processing can read frame data instead of throwing invalid-cast warnings on every capture quantum.
- `2026-03-25` Changed Talk Mode readiness so `Listening` only appears after the AudioGraph capture path is delivering frames and the recognizer warm-up completes, making it a user-facing ΓÇ£start talking nowΓÇ¥ signal rather than a timer-only state.
+- `2026-03-25` Changed recognizer recovery so silence no longer churns the Talk Mode pipeline; recycling now only happens when sustained capture speech is present but the Windows recognizer still produces no activity.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index cf27e94..c88bbf3 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -30,11 +30,13 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
- private static readonly TimeSpan RecognitionHealthCheckDelay = TimeSpan.FromSeconds(15);
+ private static readonly TimeSpan RecognitionSpeechMismatchDelay = TimeSpan.FromSeconds(2);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan HypothesisPromotionWindow = TimeSpan.FromSeconds(2);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
+ private const int RecognitionSignalBurstThreshold = 4;
+ private const float RecognitionSignalPeakThreshold = 0.03f;
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
@@ -57,6 +59,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private bool _recognitionSessionHadCaptureSignal;
private bool _recognitionHealthCheckArmed;
private bool _recognitionRestartInProgress;
+ private int _recognitionSignalBurstCount;
private bool _awaitingReply;
private bool _isSpeaking;
private bool _replyPlaybackLoopActive;
@@ -551,7 +554,6 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
await EnsureChatTransportAsync(runtimeCts.Token);
await StartRecognitionSessionAsync();
- ArmRecognitionHealthCheck();
_logger.Info("Voice runtime started in mode TalkMode");
}
catch
@@ -743,6 +745,8 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
runtimeToken = _runtimeCts.Token;
generation = ++_recognitionSessionGeneration;
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
+ _recognitionHealthCheckArmed = false;
}
_logger.Info("Starting speech recognition session");
@@ -910,6 +914,7 @@ private async Task StopRecognitionSessionAsync()
_recognitionActive = false;
_recognitionHealthCheckArmed = false;
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
}
@@ -1024,6 +1029,7 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
sessionKey = GetCurrentVoiceSessionKey();
_recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
+ _recognitionSignalBurstCount = 0;
_lastHypothesisText = text;
_lastHypothesisUtc = DateTime.UtcNow;
if (_status.State != VoiceRuntimeState.RecordingUtterance)
@@ -1077,6 +1083,7 @@ private async Task HandleRecognizedTextAsync(string text)
_recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
cancellationToken = _runtimeCts.Token;
@@ -1642,6 +1649,7 @@ private async void OnSpeechRecognitionCompleted(
sessionHadCaptureSignal = _recognitionSessionHadCaptureSignal;
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
restartInProgress = _recognitionRestartInProgress;
if (restartInProgress)
{
@@ -1650,9 +1658,7 @@ private async void OnSpeechRecognitionCompleted(
}
else
{
- _recognitionHealthCheckArmed =
- args.Status == SpeechRecognitionResultStatus.UserCanceled ||
- args.Status == SpeechRecognitionResultStatus.TimeoutExceeded;
+ _recognitionHealthCheckArmed = false;
}
token = _runtimeCts.Token;
shouldRestart = ShouldRestartRecognitionAfterCompletion(
@@ -1758,6 +1764,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_recognitionActive = false;
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
_recognitionRestartInProgress = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
@@ -1996,16 +2003,27 @@ private async void OnDefaultAudioCaptureDeviceChanged(object sender, DefaultAudi
}
}
- private void ArmRecognitionHealthCheck()
+ internal static bool ShouldTreatCaptureSignalAsSpeech(float peakLevel)
{
- lock (_gate)
- {
- _recognitionHealthCheckArmed = true;
- }
+ return peakLevel >= RecognitionSignalPeakThreshold;
+ }
+
+ internal static bool ShouldArmRecognitionRecoveryAfterCaptureSignal(
+ bool recognitionSessionHadActivity,
+ bool recognitionHealthCheckArmed,
+ int recognitionSignalBurstCount)
+ {
+ return !recognitionSessionHadActivity &&
+ !recognitionHealthCheckArmed &&
+ recognitionSignalBurstCount >= RecognitionSignalBurstThreshold;
}
private void OnCaptureSignalDetected(object? sender, VoiceCaptureSignalEventArgs args)
{
+ var shouldStartRecoveryWatchdog = false;
+ var generation = 0;
+ CancellationToken cancellationToken = default;
+
lock (_gate)
{
if (_runtimeCts == null ||
@@ -2018,7 +2036,29 @@ private void OnCaptureSignalDetected(object? sender, VoiceCaptureSignalEventArgs
return;
}
+ if (!ShouldTreatCaptureSignalAsSpeech(args.PeakLevel))
+ {
+ return;
+ }
+
_recognitionSessionHadCaptureSignal = true;
+ _recognitionSignalBurstCount++;
+
+ if (ShouldArmRecognitionRecoveryAfterCaptureSignal(
+ _recognitionSessionHadActivity,
+ _recognitionHealthCheckArmed,
+ _recognitionSignalBurstCount))
+ {
+ _recognitionHealthCheckArmed = true;
+ generation = _recognitionSessionGeneration;
+ cancellationToken = _runtimeCts.Token;
+ shouldStartRecoveryWatchdog = true;
+ }
+ }
+
+ if (shouldStartRecoveryWatchdog)
+ {
+ _ = MonitorRecognitionSessionHealthAsync(generation, cancellationToken);
}
}
@@ -2041,6 +2081,7 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
_recognitionActive = false;
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
_recognitionHealthCheckArmed = false;
}
@@ -2101,6 +2142,7 @@ private async Task RebuildVoiceCaptureAsync(string reason, CancellationToken can
captureService = _voiceCaptureService;
settings = Clone(_settings.Voice);
_recognitionSessionHadCaptureSignal = false;
+ _recognitionSignalBurstCount = 0;
}
if (captureService == null)
@@ -2117,15 +2159,18 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
{
try
{
- await Task.Delay(RecognitionHealthCheckDelay, cancellationToken);
+ await Task.Delay(RecognitionSpeechMismatchDelay, cancellationToken);
var shouldRecycle = false;
var sawCaptureSignal = false;
+ var signalBurstCount = 0;
lock (_gate)
{
sawCaptureSignal = _recognitionSessionHadCaptureSignal;
+ signalBurstCount = _recognitionSignalBurstCount;
shouldRecycle =
_recognitionHealthCheckArmed &&
+ sawCaptureSignal &&
_recognitionActive &&
_runtimeCts != null &&
!_runtimeCts.IsCancellationRequested &&
@@ -2152,11 +2197,11 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
}
_logger.Warn(
- $"Speech recognition session produced no hypotheses/results within {RecognitionHealthCheckDelay.TotalSeconds:0}s; recycling session (captureSignal={sawCaptureSignal})");
+ $"Speech recognizer heard sustained capture audio but produced no recognition activity within {RecognitionSpeechMismatchDelay.TotalSeconds:0}s; recycling session (captureSignal={sawCaptureSignal}, signalBursts={signalBurstCount})");
await StopRecognitionSessionAsync();
await ResumeRecognitionSessionAsync(
cancellationToken,
- "recognition health check",
+ "capture signal without recognition activity",
rebuildRecognizer: sawCaptureSignal);
}
catch (OperationCanceledException)
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 0916234..2b8fa33 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -180,6 +180,44 @@ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledS
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(0.029f, false)]
+ [InlineData(0.03f, true)]
+ [InlineData(0.08f, true)]
+ public void ShouldTreatCaptureSignalAsSpeech_RequiresSpeechLikePeak(float peakLevel, bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldTreatCaptureSignalAsSpeech",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(null, [peakLevel])!;
+
+ Assert.Equal(expected, result);
+ }
+
+ [Theory]
+ [InlineData(false, false, 1, false)]
+ [InlineData(false, false, 3, false)]
+ [InlineData(false, false, 4, true)]
+ [InlineData(true, false, 4, false)]
+ [InlineData(false, true, 4, false)]
+ public void ShouldArmRecognitionRecoveryAfterCaptureSignal_RequiresBurstWithoutRecognitionActivity(
+ bool recognitionSessionHadActivity,
+ bool recognitionHealthCheckArmed,
+ int recognitionSignalBurstCount,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldArmRecognitionRecoveryAfterCaptureSignal",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(
+ null,
+ [recognitionSessionHadActivity, recognitionHealthCheckArmed, recognitionSignalBurstCount])!;
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(16000, 80, 1280)]
[InlineData(16000, 0, 1280)]
From bcdaca84fe632e4fa122e855b1f25e52f6dce5c0 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 21:42:10 +0000
Subject: [PATCH 54/83] Delay deaf recognizer recovery until silence
---
docs/VOICE-MODE.md | 2 +
.../Services/Voice/VoiceService.cs | 141 ++++++++++++++----
.../VoiceServiceTransportTests.cs | 41 +++++
3 files changed, 157 insertions(+), 27 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index cfec33b..8122f73 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -544,6 +544,7 @@ At runtime today:
- `Voice.InputDeviceId` is now used by that `AudioGraph` capture path, but transcript generation still uses the Windows default speech input path until the STT adapter migration is complete
- Talk Mode only advertises `ListeningContinuously` after the capture graph has produced live frames and the recognizer warm-up window has elapsed, so the status acts as a real ΓÇ£you can start talking nowΓÇ¥ signal instead of a timer-only guess
- recognizer recovery is now speech-triggered rather than silence-triggered: the Windows recognizer is only recycled when sustained capture speech is present but no recognition activity follows
+- when a recognizer session ends after real hypothesis activity but before a final result arrives, Talk Mode now promotes the last recent hypothesis and submits it instead of dropping the utterance
- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
- explicit non-default microphone transcript generation is still pending the planned STT adapter migration
@@ -1059,3 +1060,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Fixed `VoiceCaptureService` frame-buffer interop for the current WinRT projection so AudioGraph quantum processing can read frame data instead of throwing invalid-cast warnings on every capture quantum.
- `2026-03-25` Changed Talk Mode readiness so `Listening` only appears after the AudioGraph capture path is delivering frames and the recognizer warm-up completes, making it a user-facing ΓÇ£start talking nowΓÇ¥ signal rather than a timer-only state.
- `2026-03-25` Changed recognizer recovery so silence no longer churns the Talk Mode pipeline; recycling now only happens when sustained capture speech is present but the Windows recognizer still produces no activity.
+- `2026-03-25` Delayed deaf-recognizer recovery until post-speech silence and added a completion fallback that submits the last recent hypothesis when Windows ends a session without ever producing a final result.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index c88bbf3..90e934b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -31,6 +31,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan RecognitionSpeechMismatchDelay = TimeSpan.FromSeconds(2);
+ private static readonly TimeSpan RecognitionPostSpeechSilenceBeforeRecycle = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan HypothesisPromotionWindow = TimeSpan.FromSeconds(2);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
@@ -60,6 +61,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private bool _recognitionHealthCheckArmed;
private bool _recognitionRestartInProgress;
private int _recognitionSignalBurstCount;
+ private DateTime _lastCaptureSignalUtc;
private bool _awaitingReply;
private bool _isSpeaking;
private bool _replyPlaybackLoopActive;
@@ -746,6 +748,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
generation = ++_recognitionSessionGeneration;
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
_recognitionHealthCheckArmed = false;
}
@@ -915,6 +918,7 @@ private async Task StopRecognitionSessionAsync()
_recognitionHealthCheckArmed = false;
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
}
@@ -1030,6 +1034,7 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
_recognitionSessionHadActivity = true;
_recognitionHealthCheckArmed = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
_lastHypothesisText = text;
_lastHypothesisUtc = DateTime.UtcNow;
if (_status.State != VoiceRuntimeState.RecordingUtterance)
@@ -1084,6 +1089,7 @@ private async Task HandleRecognizedTextAsync(string text)
_recognitionHealthCheckArmed = false;
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
cancellationToken = _runtimeCts.Token;
@@ -1563,6 +1569,22 @@ internal static string SelectRecognizedText(
return normalizedHypothesis;
}
+ internal static string? SelectCompletionFallbackText(
+ bool sessionHadActivity,
+ string? latestHypothesisText,
+ DateTime latestHypothesisUtc,
+ DateTime utcNow)
+ {
+ if (!sessionHadActivity ||
+ string.IsNullOrWhiteSpace(latestHypothesisText) ||
+ utcNow - latestHypothesisUtc > HypothesisPromotionWindow)
+ {
+ return null;
+ }
+
+ return latestHypothesisText.Trim();
+ }
+
private static string CreateReplyPreview(string text)
{
var trimmed = text.Trim();
@@ -1636,6 +1658,7 @@ private async void OnSpeechRecognitionCompleted(
var restartInProgress = false;
var sessionHadActivity = false;
var sessionHadCaptureSignal = false;
+ string? fallbackText = null;
lock (_gate)
{
@@ -1647,9 +1670,15 @@ private async void OnSpeechRecognitionCompleted(
_recognitionActive = false;
sessionHadActivity = _recognitionSessionHadActivity;
sessionHadCaptureSignal = _recognitionSessionHadCaptureSignal;
+ fallbackText = SelectCompletionFallbackText(
+ sessionHadActivity,
+ _lastHypothesisText,
+ _lastHypothesisUtc,
+ DateTime.UtcNow);
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
restartInProgress = _recognitionRestartInProgress;
if (restartInProgress)
{
@@ -1679,6 +1708,17 @@ private async void OnSpeechRecognitionCompleted(
_logger.Warn(
$"Speech recognition session completed with status {args.Status}; restart={shouldRestart}; rebuild={shouldRebuildRecognizer}; hadActivity={sessionHadActivity}; hadCaptureSignal={sessionHadCaptureSignal}");
+ if (!string.IsNullOrWhiteSpace(fallbackText) &&
+ !_awaitingReply &&
+ !_isSpeaking &&
+ !token.IsCancellationRequested)
+ {
+ _logger.Warn(
+ $"Voice recognition completed without a final result; promoting recent hypothesis as fallback transcript: {fallbackText}");
+ await HandleRecognizedTextAsync(fallbackText);
+ return;
+ }
+
if (shouldRestart && !token.IsCancellationRequested)
{
await Task.Delay(250, token);
@@ -1765,6 +1805,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
_recognitionRestartInProgress = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
@@ -2018,6 +2059,12 @@ internal static bool ShouldArmRecognitionRecoveryAfterCaptureSignal(
recognitionSignalBurstCount >= RecognitionSignalBurstThreshold;
}
+ internal static bool ShouldDelayRecognitionRecycleForOngoingSpeech(DateTime lastCaptureSignalUtc, DateTime utcNow)
+ {
+ return lastCaptureSignalUtc != default &&
+ utcNow - lastCaptureSignalUtc < RecognitionPostSpeechSilenceBeforeRecycle;
+ }
+
private void OnCaptureSignalDetected(object? sender, VoiceCaptureSignalEventArgs args)
{
var shouldStartRecoveryWatchdog = false;
@@ -2043,6 +2090,7 @@ private void OnCaptureSignalDetected(object? sender, VoiceCaptureSignalEventArgs
_recognitionSessionHadCaptureSignal = true;
_recognitionSignalBurstCount++;
+ _lastCaptureSignalUtc = DateTime.UtcNow;
if (ShouldArmRecognitionRecoveryAfterCaptureSignal(
_recognitionSessionHadActivity,
@@ -2082,6 +2130,7 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
_recognitionHealthCheckArmed = false;
}
@@ -2143,6 +2192,7 @@ private async Task RebuildVoiceCaptureAsync(string reason, CancellationToken can
settings = Clone(_settings.Voice);
_recognitionSessionHadCaptureSignal = false;
_recognitionSignalBurstCount = 0;
+ _lastCaptureSignalUtc = default;
}
if (captureService == null)
@@ -2161,43 +2211,80 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
{
await Task.Delay(RecognitionSpeechMismatchDelay, cancellationToken);
- var shouldRecycle = false;
- var sawCaptureSignal = false;
- var signalBurstCount = 0;
- lock (_gate)
+ bool shouldRecycle;
+ bool sawCaptureSignal;
+ int signalBurstCount;
+ DateTime lastCaptureSignalUtc;
+
+ while (true)
{
- sawCaptureSignal = _recognitionSessionHadCaptureSignal;
- signalBurstCount = _recognitionSignalBurstCount;
- shouldRecycle =
- _recognitionHealthCheckArmed &&
- sawCaptureSignal &&
- _recognitionActive &&
- _runtimeCts != null &&
- !_runtimeCts.IsCancellationRequested &&
- _status.Running &&
- _status.Mode == VoiceActivationMode.TalkMode &&
- !_awaitingReply &&
- !_isSpeaking &&
- generation == _recognitionSessionGeneration;
+ shouldRecycle = false;
+ sawCaptureSignal = false;
+ signalBurstCount = 0;
+ lastCaptureSignalUtc = default;
- if (shouldRecycle)
+ lock (_gate)
{
- _recognitionRestartInProgress = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.Arming,
- "Speech recognizer stalled; restarting listening.");
+ sawCaptureSignal = _recognitionSessionHadCaptureSignal;
+ signalBurstCount = _recognitionSignalBurstCount;
+ lastCaptureSignalUtc = _lastCaptureSignalUtc;
+ shouldRecycle =
+ _recognitionHealthCheckArmed &&
+ sawCaptureSignal &&
+ _recognitionActive &&
+ _runtimeCts != null &&
+ !_runtimeCts.IsCancellationRequested &&
+ _status.Running &&
+ _status.Mode == VoiceActivationMode.TalkMode &&
+ !_awaitingReply &&
+ !_isSpeaking &&
+ generation == _recognitionSessionGeneration;
+ }
+
+ if (!shouldRecycle)
+ {
+ return;
}
+
+ if (!ShouldDelayRecognitionRecycleForOngoingSpeech(lastCaptureSignalUtc, DateTime.UtcNow))
+ {
+ break;
+ }
+
+ var remainingDelay = RecognitionPostSpeechSilenceBeforeRecycle - (DateTime.UtcNow - lastCaptureSignalUtc);
+ if (remainingDelay < TimeSpan.FromMilliseconds(50))
+ {
+ remainingDelay = TimeSpan.FromMilliseconds(50);
+ }
+
+ await Task.Delay(remainingDelay, cancellationToken);
}
- if (!shouldRecycle)
+ lock (_gate)
{
- return;
+ if (!(_recognitionHealthCheckArmed &&
+ _recognitionActive &&
+ _runtimeCts != null &&
+ !_runtimeCts.IsCancellationRequested &&
+ _status.Running &&
+ _status.Mode == VoiceActivationMode.TalkMode &&
+ !_awaitingReply &&
+ !_isSpeaking &&
+ generation == _recognitionSessionGeneration))
+ {
+ return;
+ }
+
+ _recognitionRestartInProgress = true;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ "Speech recognizer stalled; restarting listening.");
}
_logger.Warn(
- $"Speech recognizer heard sustained capture audio but produced no recognition activity within {RecognitionSpeechMismatchDelay.TotalSeconds:0}s; recycling session (captureSignal={sawCaptureSignal}, signalBursts={signalBurstCount})");
+ $"Speech recognizer heard sustained capture audio but produced no recognition activity within {RecognitionSpeechMismatchDelay.TotalSeconds:0}s and after post-speech silence; recycling session (captureSignal={sawCaptureSignal}, signalBursts={signalBurstCount})");
await StopRecognitionSessionAsync();
await ResumeRecognitionSessionAsync(
cancellationToken,
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 2b8fa33..b1bc39b 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -218,6 +218,24 @@ public void ShouldArmRecognitionRecoveryAfterCaptureSignal_RequiresBurstWithoutR
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(0, true)]
+ [InlineData(500, true)]
+ [InlineData(749, true)]
+ [InlineData(750, false)]
+ [InlineData(1200, false)]
+ public void ShouldDelayRecognitionRecycleForOngoingSpeech_RequiresShortRecentSignal(int elapsedMs, bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldDelayRecognitionRecycleForOngoingSpeech",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ var now = new DateTime(2026, 3, 25, 21, 36, 35, DateTimeKind.Utc);
+
+ var result = (bool)method.Invoke(null, [now.AddMilliseconds(-elapsedMs), now])!;
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(16000, 80, 1280)]
[InlineData(16000, 0, 1280)]
@@ -281,6 +299,29 @@ public void SelectRecognizedText_PromotesRecentLongerHypothesisWhenFinalLooksTru
Assert.Equal(expectedPromoted, (bool)args[4]!);
}
+ [Theory]
+ [InlineData(true, "Now again testing", 1, "Now again testing")]
+ [InlineData(true, "Now again testing", 3, null)]
+ [InlineData(false, "Now again testing", 1, null)]
+ [InlineData(true, "", 1, null)]
+ public void SelectCompletionFallbackText_PromotesRecentHypothesisWhenSessionHadActivity(
+ bool sessionHadActivity,
+ string hypothesis,
+ int hypothesisAgeSeconds,
+ string? expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "SelectCompletionFallbackText",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+ var now = new DateTime(2026, 3, 25, 21, 36, 35, DateTimeKind.Utc);
+
+ var result = (string?)method.Invoke(
+ null,
+ [sessionHadActivity, hypothesis, now.AddSeconds(-hypothesisAgeSeconds), now]);
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(true, VoiceActivationMode.TalkMode, null, AudioDeviceRole.Default, true)]
[InlineData(true, VoiceActivationMode.TalkMode, "", AudioDeviceRole.Default, true)]
From 4bf802aa43822d4ead89c98bf87eb46809a67bf4 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 21:52:07 +0000
Subject: [PATCH 55/83] Fix overlapping talk mode recovery watchdogs
---
docs/VOICE-MODE.md | 2 ++
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 8122f73..eb4ec33 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -545,6 +545,7 @@ At runtime today:
- Talk Mode only advertises `ListeningContinuously` after the capture graph has produced live frames and the recognizer warm-up window has elapsed, so the status acts as a real ΓÇ£you can start talking nowΓÇ¥ signal instead of a timer-only guess
- recognizer recovery is now speech-triggered rather than silence-triggered: the Windows recognizer is only recycled when sustained capture speech is present but no recognition activity follows
- when a recognizer session ends after real hypothesis activity but before a final result arrives, Talk Mode now promotes the last recent hypothesis and submits it instead of dropping the utterance
+- the speech-mismatch recovery watchdog is single-owner and only armed from capture speech, so a new recognition session does not spawn overlapping recovery loops
- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
- explicit non-default microphone transcript generation is still pending the planned STT adapter migration
@@ -1061,3 +1062,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Changed Talk Mode readiness so `Listening` only appears after the AudioGraph capture path is delivering frames and the recognizer warm-up completes, making it a user-facing ΓÇ£start talking nowΓÇ¥ signal rather than a timer-only state.
- `2026-03-25` Changed recognizer recovery so silence no longer churns the Talk Mode pipeline; recycling now only happens when sustained capture speech is present but the Windows recognizer still produces no activity.
- `2026-03-25` Delayed deaf-recognizer recovery until post-speech silence and added a completion fallback that submits the last recent hypothesis when Windows ends a session without ever producing a final result.
+- `2026-03-25` Fixed overlapping Talk Mode recovery watchdogs so a new recognition session no longer launches duplicate deaf-recognizer recycle loops.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 90e934b..feb45b5 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -30,7 +30,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
- private static readonly TimeSpan RecognitionSpeechMismatchDelay = TimeSpan.FromSeconds(2);
+ private static readonly TimeSpan RecognitionSpeechMismatchDelay = TimeSpan.FromSeconds(4);
private static readonly TimeSpan RecognitionPostSpeechSilenceBeforeRecycle = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan HypothesisPromotionWindow = TimeSpan.FromSeconds(2);
@@ -777,7 +777,6 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
_ = MonitorListeningReadyAsync(generation, runtimeToken);
}
- _ = MonitorRecognitionSessionHealthAsync(generation, runtimeToken);
}
private async Task MonitorListeningReadyAsync(int generation, CancellationToken cancellationToken)
@@ -2275,6 +2274,7 @@ private async Task MonitorRecognitionSessionHealthAsync(int generation, Cancella
return;
}
+ _recognitionHealthCheckArmed = false;
_recognitionRestartInProgress = true;
_status = BuildRunningStatus(
VoiceActivationMode.TalkMode,
From 93f01dd6b3d1acfbbb6adeb7652fed8d655d653f Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 22:02:47 +0000
Subject: [PATCH 56/83] Fix talk mode playback failure handling
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 22 +++++++++++++++++--
2 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index eb4ec33..dae7253 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -1063,3 +1063,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Changed recognizer recovery so silence no longer churns the Talk Mode pipeline; recycling now only happens when sustained capture speech is present but the Windows recognizer still produces no activity.
- `2026-03-25` Delayed deaf-recognizer recovery until post-speech silence and added a completion fallback that submits the last recent hypothesis when Windows ends a session without ever producing a final result.
- `2026-03-25` Fixed overlapping Talk Mode recovery watchdogs so a new recognition session no longer launches duplicate deaf-recognizer recycle loops.
+- `2026-03-25` Fixed Talk Mode media playback failure handling so a failed reply no longer leaks an unobserved task exception after the reply audio arrives.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index feb45b5..02908d2 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1613,8 +1613,17 @@ private static async Task PlayStreamAsync(
endedHandler = (sender, _) => playbackEnded.TrySetResult(true);
failedHandler = (sender, args) =>
{
- var exception = new InvalidOperationException(args.ErrorMessage);
- mediaOpened.TrySetException(exception);
+ var errorMessage = string.IsNullOrWhiteSpace(args.ErrorMessage)
+ ? "Media playback failed."
+ : args.ErrorMessage;
+ var exception = new InvalidOperationException(errorMessage);
+
+ if (!mediaOpened.Task.IsCompleted)
+ {
+ mediaOpened.TrySetException(exception);
+ return;
+ }
+
playbackEnded.TrySetException(exception);
};
@@ -1636,6 +1645,15 @@ private static async Task PlayStreamAsync(
player.Play();
await playbackEnded.Task;
}
+ catch
+ {
+ if (playbackEnded.Task.IsFaulted)
+ {
+ _ = playbackEnded.Task.Exception;
+ }
+
+ throw;
+ }
finally
{
player.MediaOpened -= openedHandler;
From cd013ac0994537031f0de12d07229a1c13bbfb7c Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 22:30:42 +0000
Subject: [PATCH 57/83] Add ElevenLabs websocket TTS provider
---
docs/VOICE-MODE.md | 31 ++++++++++++-------
.../Assets/voice-providers.json | 25 +++++++++------
.../Voice/VoiceCloudTextToSpeechClient.cs | 16 +++++++---
.../VoiceProviderCatalogServiceTests.cs | 19 ++++++++----
4 files changed, 60 insertions(+), 31 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index dae7253..ef12bed 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -155,7 +155,7 @@ The main remaining gap is streaming playback from the first audio chunk. The Azu
- Windows `SpeechSynthesizer` is used through `SynthesizeTextToStreamAsync`, which returns a complete stream for playback
- MiniMax now uses the provider catalog's WebSocket TTS contract, but the current player still waits for a complete playable stream before output starts
-- ElevenLabs is currently integrated through the non-streaming convert contract in the provider catalog
+- ElevenLabs now uses the provider catalog's `stream-input` WebSocket contract, but the current player still waits for a complete playable stream before output starts
So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming. This is, however, planned for an early release.
@@ -316,7 +316,7 @@ Example:
"name": "ElevenLabs",
"runtime": "cloud",
"enabled": true,
- "description": "Cloud TTS using the ElevenLabs create speech API.",
+ "description": "Cloud TTS using the ElevenLabs WebSocket stream-input API.",
"settings": [
{ "key": "apiKey", "label": "API key", "secret": true },
{
@@ -330,22 +330,28 @@ Example:
"eleven_monolingual_v1"
]
},
- { "key": "voiceId", "label": "Voice ID", "placeholder": "Enter an ElevenLabs voice ID" },
+ { "key": "voiceId", "label": "Voice ID", "defaultValue": "6aDn1KB0hjpdcocrUkmq", "placeholder": "Enter an ElevenLabs voice ID" },
{
"key": "voiceSettingsJson",
"label": "Voice settings JSON",
- "defaultValue": "\"voice_settings\": null",
- "placeholder": "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }"
+ "defaultValue": "\"voice_settings\": { \"speed\": 0.9, \"stability\": 0.5, \"similarity_boost\": 0.75 }",
+ "placeholder": "\"voice_settings\": { \"speed\": 0.9, \"stability\": 0.5, \"similarity_boost\": 0.75 }"
}
],
- "textToSpeechHttp": {
- "endpointTemplate": "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
- "httpMethod": "POST",
+ "textToSpeechWebSocket": {
+ "endpointTemplate": "wss://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}/stream-input?model_id={{model}}&output_format=mp3_44100_128&auto_mode=true",
"authenticationHeaderName": "xi-api-key",
+ "authenticationScheme": "",
"apiKeySettingKey": "apiKey",
- "requestContentType": "application/json",
- "requestBodyTemplate": "{ \"text\": {{text}}, \"model_id\": {{model}}, {{voiceSettingsJson}} }",
- "responseAudioMode": "binary",
+ "connectSuccessEventName": "",
+ "startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}} }",
+ "startSuccessEventName": "",
+ "continueMessageTemplate": "{ \"text\": {{text}} }",
+ "finishMessageTemplate": "{ \"text\": \"\" }",
+ "responseAudioMode": "base64JsonString",
+ "responseAudioJsonPath": "audio",
+ "finalFlagJsonPath": "isFinal",
+ "taskFailedEventName": "error",
"outputContentType": "audio/mpeg"
}
}
@@ -394,7 +400,7 @@ without hard-coding provider-specific wrapper keys into the runtime.
The current cloud TTS transports are:
- `MiniMax`: catalog-driven WebSocket synthesis
-- `ElevenLabs`: catalog-driven HTTP synthesis
+- `ElevenLabs`: catalog-driven WebSocket synthesis (`stream-input`)
For `VoiceWake`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
@@ -1064,3 +1070,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Delayed deaf-recognizer recovery until post-speech silence and added a completion fallback that submits the last recent hypothesis when Windows ends a session without ever producing a final result.
- `2026-03-25` Fixed overlapping Talk Mode recovery watchdogs so a new recognition session no longer launches duplicate deaf-recognizer recycle loops.
- `2026-03-25` Fixed Talk Mode media playback failure handling so a failed reply no longer leaks an unobserved task exception after the reply audio arrives.
+- `2026-03-25` Generalized the catalog-driven WebSocket TTS client to support providers without explicit connect/start acknowledgements and switched ElevenLabs to the `stream-input` WebSocket API with default voice settings.
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index c0b86d4..fb0e090 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -83,7 +83,7 @@
"name": "ElevenLabs",
"runtime": "cloud",
"enabled": true,
- "description": "Cloud TTS using the ElevenLabs create speech API.",
+ "description": "Cloud TTS using the ElevenLabs WebSocket stream-input API.",
"settings": [
{
"key": "apiKey",
@@ -105,6 +105,7 @@
"key": "voiceId",
"label": "Voice ID",
"required": false,
+ "defaultValue": "6aDn1KB0hjpdcocrUkmq",
"placeholder": "Enter an ElevenLabs voice ID"
},
{
@@ -112,19 +113,25 @@
"label": "Voice settings JSON",
"required": false,
"jsonValue": true,
- "defaultValue": "\"voice_settings\": null",
- "placeholder": "\"voice_settings\": { \"stability\": 0.5, \"similarity_boost\": 0.8 }",
+ "defaultValue": "\"voice_settings\": { \"speed\": 0.9, \"stability\": 0.5, \"similarity_boost\": 0.75 }",
+ "placeholder": "\"voice_settings\": { \"speed\": 0.9, \"stability\": 0.5, \"similarity_boost\": 0.75 }",
"description": "Optional full ElevenLabs request fragment. If present, it controls the full voice_settings payload."
}
],
- "textToSpeechHttp": {
- "endpointTemplate": "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
- "httpMethod": "POST",
+ "textToSpeechWebSocket": {
+ "endpointTemplate": "wss://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}/stream-input?model_id={{model}}&output_format=mp3_44100_128&auto_mode=true",
"authenticationHeaderName": "xi-api-key",
+ "authenticationScheme": "",
"apiKeySettingKey": "apiKey",
- "requestContentType": "application/json",
- "requestBodyTemplate": "{ \"text\": {{text}}, \"model_id\": {{model}}, {{voiceSettingsJson}} }",
- "responseAudioMode": "binary",
+ "connectSuccessEventName": "",
+ "startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}} }",
+ "startSuccessEventName": "",
+ "continueMessageTemplate": "{ \"text\": {{text}} }",
+ "finishMessageTemplate": "{ \"text\": \"\" }",
+ "responseAudioMode": "base64JsonString",
+ "responseAudioJsonPath": "audio",
+ "finalFlagJsonPath": "isFinal",
+ "taskFailedEventName": "error",
"outputContentType": "audio/mpeg"
}
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index ad36486..da452c2 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -95,13 +95,21 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
var stopwatch = Stopwatch.StartNew();
await socket.ConnectAsync(new Uri(endpoint), CancellationToken.None);
- var connectedMessage = await ReceiveJsonMessageAsync(socket);
- ValidateWebSocketEvent(provider.Name, contract.ConnectSuccessEventName, connectedMessage, contract);
+
+ if (!string.IsNullOrWhiteSpace(contract.ConnectSuccessEventName))
+ {
+ var connectedMessage = await ReceiveJsonMessageAsync(socket);
+ ValidateWebSocketEvent(provider.Name, contract.ConnectSuccessEventName, connectedMessage, contract);
+ }
var startMessage = ApplyJsonTemplate(contract.StartMessageTemplate, templateValues);
await SendTextMessageAsync(socket, startMessage);
- var startedMessage = await ReceiveJsonMessageAsync(socket);
- ValidateWebSocketEvent(provider.Name, contract.StartSuccessEventName, startedMessage, contract);
+
+ if (!string.IsNullOrWhiteSpace(contract.StartSuccessEventName))
+ {
+ var startedMessage = await ReceiveJsonMessageAsync(socket);
+ ValidateWebSocketEvent(provider.Name, contract.StartSuccessEventName, startedMessage, contract);
+ }
var continueMessage = ApplyJsonTemplate(contract.ContinueMessageTemplate, templateValues);
await SendTextMessageAsync(socket, continueMessage);
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 192ac01..60e9aee 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -78,19 +78,26 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
var elevenLabs = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.ElevenLabs);
Assert.Equal("ElevenLabs", elevenLabs.Name);
- Assert.NotNull(elevenLabs.TextToSpeechHttp);
+ Assert.NotNull(elevenLabs.TextToSpeechWebSocket);
Assert.Equal(
- "https://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}?output_format=mp3_44100_128",
- elevenLabs.TextToSpeechHttp!.EndpointTemplate);
- Assert.Equal("xi-api-key", elevenLabs.TextToSpeechHttp.AuthenticationHeaderName);
- Assert.Equal(VoiceTextToSpeechResponseModes.Binary, elevenLabs.TextToSpeechHttp.ResponseAudioMode);
+ "wss://api.elevenlabs.io/v1/text-to-speech/{{voiceId}}/stream-input?model_id={{model}}&output_format=mp3_44100_128&auto_mode=true",
+ elevenLabs.TextToSpeechWebSocket!.EndpointTemplate);
+ Assert.Equal("xi-api-key", elevenLabs.TextToSpeechWebSocket.AuthenticationHeaderName);
+ Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.AuthenticationScheme);
+ Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.ConnectSuccessEventName);
+ Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.StartSuccessEventName);
+ Assert.Equal(VoiceTextToSpeechResponseModes.Base64JsonString, elevenLabs.TextToSpeechWebSocket.ResponseAudioMode);
+ Assert.Equal("audio", elevenLabs.TextToSpeechWebSocket.ResponseAudioJsonPath);
+ Assert.Equal("isFinal", elevenLabs.TextToSpeechWebSocket.FinalFlagJsonPath);
var elevenLabsModelSetting = elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model);
Assert.Equal("eleven_multilingual_v2", elevenLabsModelSetting.DefaultValue);
Assert.Contains("eleven_flash_v2_5", elevenLabsModelSetting.Options);
Assert.Contains("eleven_turbo_v2_5", elevenLabsModelSetting.Options);
+ Assert.Equal("6aDn1KB0hjpdcocrUkmq", elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.VoiceId).DefaultValue);
var elevenLabsVoiceSettingsJson = elevenLabs.Settings.Single(s => s.Key == VoiceProviderSettingKeys.VoiceSettingsJson);
Assert.False(elevenLabsVoiceSettingsJson.Required);
Assert.True(elevenLabsVoiceSettingsJson.JsonValue);
- Assert.Equal("\"voice_settings\": null", elevenLabsVoiceSettingsJson.DefaultValue);
+ Assert.Contains("\"voice_settings\":", elevenLabsVoiceSettingsJson.DefaultValue);
+ Assert.Contains("\"speed\": 0.9", elevenLabsVoiceSettingsJson.DefaultValue);
}
}
From 5dadf15f3104cb2553444442edb44d19206bf56a Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 22:49:41 +0000
Subject: [PATCH 58/83] Adjust voice tray icon states
---
docs/VOICE-MODE.md | 1 +
src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs | 26 +++++++------------
2 files changed, 10 insertions(+), 17 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index ef12bed..56030d7 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -1071,3 +1071,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Fixed overlapping Talk Mode recovery watchdogs so a new recognition session no longer launches duplicate deaf-recognizer recycle loops.
- `2026-03-25` Fixed Talk Mode media playback failure handling so a failed reply no longer leaks an unobserved task exception after the reply audio arrives.
- `2026-03-25` Generalized the catalog-driven WebSocket TTS client to support providers without explicit connect/start acknowledgements and switched ElevenLabs to the `stream-input` WebSocket API with default voice settings.
+- `2026-03-25` Reworked the dynamic tray icon language so listening shows activity waves around the headphones and speaking uses a microphone badge instead of a speaker icon.
diff --git a/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs b/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs
index 71a28ab..677ab93 100644
--- a/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs
+++ b/src/OpenClaw.Tray.WinUI/Helpers/IconHelper.cs
@@ -219,11 +219,10 @@ private static Bitmap CreateVoiceTrayBitmap(VoiceTrayIconState state)
break;
case VoiceTrayIconState.Listening:
DrawHeadphones(graphics);
- DrawMicrophone(graphics);
+ DrawHeadphoneWaves(graphics);
break;
case VoiceTrayIconState.Speaking:
- DrawHeadphones(graphics);
- DrawSpeaker(graphics);
+ DrawMicrophone(graphics);
break;
}
@@ -253,23 +252,16 @@ private static void DrawMicrophone(Graphics graphics)
graphics.DrawLine(pen, 20, 21, 15, 19);
}
- private static void DrawSpeaker(Graphics graphics)
+ private static void DrawHeadphoneWaves(Graphics graphics)
{
- using var brush = new SolidBrush(Color.FromArgb(76, 175, 80));
- using var pen = new Pen(Color.FromArgb(76, 175, 80), 2f);
- using var thinPen = new Pen(Color.FromArgb(76, 175, 80), 1.5f);
+ using var wavePen = new Pen(Color.FromArgb(76, 175, 80), 2f);
+ using var accentPen = new Pen(Color.FromArgb(76, 175, 80), 1.5f);
- var points = new[]
- {
- new Point(24, 17),
- new Point(19, 20),
- new Point(19, 24),
- new Point(24, 27)
- };
+ graphics.DrawArc(wavePen, 0, 12, 8, 8, 270, 180);
+ graphics.DrawArc(accentPen, 2, 14, 4, 4, 270, 180);
- graphics.FillPolygon(brush, points);
- graphics.DrawArc(pen, 22, 17, 6, 10, 300, 120);
- graphics.DrawArc(thinPen, 21, 14, 10, 16, 300, 120);
+ graphics.DrawArc(wavePen, 24, 12, 8, 8, 90, 180);
+ graphics.DrawArc(accentPen, 26, 14, 4, 4, 90, 180);
}
private static Icon CreateIcon(Bitmap bitmap)
From e22474bd583b40585c209c1957257c36fd01c576 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 23:06:25 +0000
Subject: [PATCH 59/83] Tune ElevenLabs websocket turn flushing
---
docs/VOICE-MODE.md | 5 +++--
src/OpenClaw.Tray.WinUI/Assets/voice-providers.json | 4 ++--
.../OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs | 2 ++
3 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 56030d7..8445a55 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -344,9 +344,9 @@ Example:
"authenticationScheme": "",
"apiKeySettingKey": "apiKey",
"connectSuccessEventName": "",
- "startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}} }",
+ "startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}}, \"xi_api_key\": {{apiKey}} }",
"startSuccessEventName": "",
- "continueMessageTemplate": "{ \"text\": {{text}} }",
+ "continueMessageTemplate": "{ \"text\": {{text}}, \"flush\": true }",
"finishMessageTemplate": "{ \"text\": \"\" }",
"responseAudioMode": "base64JsonString",
"responseAudioJsonPath": "audio",
@@ -1072,3 +1072,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Fixed Talk Mode media playback failure handling so a failed reply no longer leaks an unobserved task exception after the reply audio arrives.
- `2026-03-25` Generalized the catalog-driven WebSocket TTS client to support providers without explicit connect/start acknowledgements and switched ElevenLabs to the `stream-input` WebSocket API with default voice settings.
- `2026-03-25` Reworked the dynamic tray icon language so listening shows activity waves around the headphones and speaking uses a microphone badge instead of a speaker icon.
+- `2026-03-25` Adjusted the ElevenLabs `stream-input` contract to send `xi_api_key` in the init message and `flush: true` with the text turn so short replies are emitted promptly instead of being buffered then closed.
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index fb0e090..456858e 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -124,9 +124,9 @@
"authenticationScheme": "",
"apiKeySettingKey": "apiKey",
"connectSuccessEventName": "",
- "startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}} }",
+ "startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}}, \"xi_api_key\": {{apiKey}} }",
"startSuccessEventName": "",
- "continueMessageTemplate": "{ \"text\": {{text}} }",
+ "continueMessageTemplate": "{ \"text\": {{text}}, \"flush\": true }",
"finishMessageTemplate": "{ \"text\": \"\" }",
"responseAudioMode": "base64JsonString",
"responseAudioJsonPath": "audio",
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 60e9aee..8b7c1b0 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -86,6 +86,8 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.AuthenticationScheme);
Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.ConnectSuccessEventName);
Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.StartSuccessEventName);
+ Assert.Contains("\"xi_api_key\": {{apiKey}}", elevenLabs.TextToSpeechWebSocket.StartMessageTemplate);
+ Assert.Contains("\"flush\": true", elevenLabs.TextToSpeechWebSocket.ContinueMessageTemplate);
Assert.Equal(VoiceTextToSpeechResponseModes.Base64JsonString, elevenLabs.TextToSpeechWebSocket.ResponseAudioMode);
Assert.Equal("audio", elevenLabs.TextToSpeechWebSocket.ResponseAudioJsonPath);
Assert.Equal("isFinal", elevenLabs.TextToSpeechWebSocket.FinalFlagJsonPath);
From 39623a66c76ebb7e291ef7eb83a0796d621080f3 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 23:12:05 +0000
Subject: [PATCH 60/83] Stop premature ElevenLabs websocket EOS
---
docs/VOICE-MODE.md | 3 ++-
src/OpenClaw.Tray.WinUI/Assets/voice-providers.json | 2 +-
tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs | 1 +
3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 8445a55..8a08851 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -347,7 +347,7 @@ Example:
"startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}}, \"xi_api_key\": {{apiKey}} }",
"startSuccessEventName": "",
"continueMessageTemplate": "{ \"text\": {{text}}, \"flush\": true }",
- "finishMessageTemplate": "{ \"text\": \"\" }",
+ "finishMessageTemplate": "",
"responseAudioMode": "base64JsonString",
"responseAudioJsonPath": "audio",
"finalFlagJsonPath": "isFinal",
@@ -1073,3 +1073,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Generalized the catalog-driven WebSocket TTS client to support providers without explicit connect/start acknowledgements and switched ElevenLabs to the `stream-input` WebSocket API with default voice settings.
- `2026-03-25` Reworked the dynamic tray icon language so listening shows activity waves around the headphones and speaking uses a microphone badge instead of a speaker icon.
- `2026-03-25` Adjusted the ElevenLabs `stream-input` contract to send `xi_api_key` in the init message and `flush: true` with the text turn so short replies are emitted promptly instead of being buffered then closed.
+- `2026-03-25` Stopped sending an immediate ElevenLabs EOS message before reading audio, so single-turn `stream-input` replies rely on `flush: true` instead of prematurely closing the socket.
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index 456858e..1b7d8da 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -127,7 +127,7 @@
"startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}}, \"xi_api_key\": {{apiKey}} }",
"startSuccessEventName": "",
"continueMessageTemplate": "{ \"text\": {{text}}, \"flush\": true }",
- "finishMessageTemplate": "{ \"text\": \"\" }",
+ "finishMessageTemplate": "",
"responseAudioMode": "base64JsonString",
"responseAudioJsonPath": "audio",
"finalFlagJsonPath": "isFinal",
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 8b7c1b0..f590c11 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -88,6 +88,7 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.StartSuccessEventName);
Assert.Contains("\"xi_api_key\": {{apiKey}}", elevenLabs.TextToSpeechWebSocket.StartMessageTemplate);
Assert.Contains("\"flush\": true", elevenLabs.TextToSpeechWebSocket.ContinueMessageTemplate);
+ Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.FinishMessageTemplate);
Assert.Equal(VoiceTextToSpeechResponseModes.Base64JsonString, elevenLabs.TextToSpeechWebSocket.ResponseAudioMode);
Assert.Equal("audio", elevenLabs.TextToSpeechWebSocket.ResponseAudioJsonPath);
Assert.Equal("isFinal", elevenLabs.TextToSpeechWebSocket.FinalFlagJsonPath);
From 8499202a8b5f614a2a457e53ff0095bf0e3ff073 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Wed, 25 Mar 2026 23:19:32 +0000
Subject: [PATCH 61/83] Match ElevenLabs websocket generation flow
---
docs/VOICE-MODE.md | 5 +++--
src/OpenClaw.Tray.WinUI/Assets/voice-providers.json | 4 ++--
.../Services/Voice/VoiceCloudTextToSpeechClient.cs | 13 +++++++++++--
.../VoiceProviderCatalogServiceTests.cs | 5 +++--
4 files changed, 19 insertions(+), 8 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 8a08851..c087286 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -346,8 +346,8 @@ Example:
"connectSuccessEventName": "",
"startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}}, \"xi_api_key\": {{apiKey}} }",
"startSuccessEventName": "",
- "continueMessageTemplate": "{ \"text\": {{text}}, \"flush\": true }",
- "finishMessageTemplate": "",
+ "continueMessageTemplate": "{ \"text\": {{textWithTrailingSpace}}, \"try_trigger_generation\": true }",
+ "finishMessageTemplate": "{ \"text\": \"\" }",
"responseAudioMode": "base64JsonString",
"responseAudioJsonPath": "audio",
"finalFlagJsonPath": "isFinal",
@@ -1074,3 +1074,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Reworked the dynamic tray icon language so listening shows activity waves around the headphones and speaking uses a microphone badge instead of a speaker icon.
- `2026-03-25` Adjusted the ElevenLabs `stream-input` contract to send `xi_api_key` in the init message and `flush: true` with the text turn so short replies are emitted promptly instead of being buffered then closed.
- `2026-03-25` Stopped sending an immediate ElevenLabs EOS message before reading audio, so single-turn `stream-input` replies rely on `flush: true` instead of prematurely closing the socket.
+- `2026-03-25` Switched ElevenLabs `stream-input` to the published text-generation pattern: send the text turn with a trailing-space variant plus `try_trigger_generation: true`, then send EOS, and include the WebSocket close status in playback errors.
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index 1b7d8da..996bfce 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -126,8 +126,8 @@
"connectSuccessEventName": "",
"startMessageTemplate": "{ \"text\": \" \", {{voiceSettingsJson}}, \"xi_api_key\": {{apiKey}} }",
"startSuccessEventName": "",
- "continueMessageTemplate": "{ \"text\": {{text}}, \"flush\": true }",
- "finishMessageTemplate": "",
+ "continueMessageTemplate": "{ \"text\": {{textWithTrailingSpace}}, \"try_trigger_generation\": true }",
+ "finishMessageTemplate": "{ \"text\": \"\" }",
"responseAudioMode": "base64JsonString",
"responseAudioJsonPath": "audio",
"finalFlagJsonPath": "isFinal",
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index da452c2..52cca0a 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -179,7 +179,9 @@ private static Dictionary<string, TemplateValue> BuildTemplateValues(
{
var values = new Dictionary<string, TemplateValue>(StringComparer.OrdinalIgnoreCase)
{
- ["text"] = TemplateValue.FromString(text)
+ ["text"] = TemplateValue.FromString(text),
+ ["textWithTrailingSpace"] = TemplateValue.FromString(
+ text.EndsWith(' ') ? text : text + " ")
};
foreach (var setting in provider.Settings)
@@ -496,7 +498,14 @@ private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket s
if (result.MessageType == WebSocketMessageType.Close)
{
- throw new InvalidOperationException("Voice provider closed the WebSocket unexpectedly.");
+ var closeStatus = socket.CloseStatus?.ToString() ?? "Unknown";
+ var closeDescription = string.IsNullOrWhiteSpace(socket.CloseStatusDescription)
+ ? null
+ : socket.CloseStatusDescription;
+ throw new InvalidOperationException(
+ string.IsNullOrWhiteSpace(closeDescription)
+ ? $"Voice provider closed the WebSocket unexpectedly ({closeStatus})."
+ : $"Voice provider closed the WebSocket unexpectedly ({closeStatus}: {closeDescription}).");
}
buffer.Write(receiveBuffer, 0, result.Count);
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index f590c11..34d5245 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -87,8 +87,9 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.ConnectSuccessEventName);
Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.StartSuccessEventName);
Assert.Contains("\"xi_api_key\": {{apiKey}}", elevenLabs.TextToSpeechWebSocket.StartMessageTemplate);
- Assert.Contains("\"flush\": true", elevenLabs.TextToSpeechWebSocket.ContinueMessageTemplate);
- Assert.Equal(string.Empty, elevenLabs.TextToSpeechWebSocket.FinishMessageTemplate);
+ Assert.Contains("\"try_trigger_generation\": true", elevenLabs.TextToSpeechWebSocket.ContinueMessageTemplate);
+ Assert.Contains("{{textWithTrailingSpace}}", elevenLabs.TextToSpeechWebSocket.ContinueMessageTemplate);
+ Assert.Equal("{ \"text\": \"\" }", elevenLabs.TextToSpeechWebSocket.FinishMessageTemplate);
Assert.Equal(VoiceTextToSpeechResponseModes.Base64JsonString, elevenLabs.TextToSpeechWebSocket.ResponseAudioMode);
Assert.Equal("audio", elevenLabs.TextToSpeechWebSocket.ResponseAudioJsonPath);
Assert.Equal("isFinal", elevenLabs.TextToSpeechWebSocket.FinalFlagJsonPath);
From e8fefaf66119ea77e71c7b580189e22bc358e650 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 09:41:22 +0000
Subject: [PATCH 62/83] Initial plan
From 14f42794d9969499d084b655efff15802ef4786a Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 09:41:32 +0000
Subject: [PATCH 63/83] Initial plan
From 91de10068c83932f647fc862819bb17b7235996b Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 09:41:40 +0000
Subject: [PATCH 64/83] Update src/OpenClaw.Shared/OpenClawGatewayClient.cs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
src/OpenClaw.Shared/OpenClawGatewayClient.cs | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/OpenClaw.Shared/OpenClawGatewayClient.cs b/src/OpenClaw.Shared/OpenClawGatewayClient.cs
index 25710e7..80623c7 100644
--- a/src/OpenClaw.Shared/OpenClawGatewayClient.cs
+++ b/src/OpenClaw.Shared/OpenClawGatewayClient.cs
@@ -1311,6 +1311,7 @@ private void ClearPendingChatPreviewSessions()
lock (_pendingChatPreviewLock)
{
_pendingChatPreviewSessionKeys.Clear();
+ _lastAssistantMessagesBySession.Clear();
}
}
From 7291d7846c2b5af4927f93c96daa261138377839 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 09:43:48 +0000
Subject: [PATCH 65/83] Update
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
.../Services/Voice/VoiceProviderCatalogService.cs | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index 275806e..a612576 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -78,8 +78,16 @@ public static bool SupportsTextToSpeechRuntime(string? providerId)
return true;
}
- var provider = ResolveTextToSpeechProvider(providerId);
- return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
+ try
+ {
+ var provider = ResolveTextToSpeechProvider(providerId);
+ return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
+ }
+ catch
+ {
+ // If the catalog or provider cannot be resolved, treat as unsupported
+ return false;
+ }
}
private static VoiceProviderCatalog NormalizeCatalog(VoiceProviderCatalog catalog)
From 0a0c3bec3e3ab7e9f51d2495c54e7dbf7461a90e Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 09:44:19 +0000
Subject: [PATCH 66/83] Update
src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 661013c..2521600 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -78,8 +78,9 @@ public sealed partial class WebChatWindow : WindowEx
const tag = node.parentElement.tagName;
if (tag === 'SCRIPT' || tag === 'STYLE' || tag === 'TEXTAREA') continue;
const original = node.textContent || '';
- const cleaned = original.replace(memoryPattern, '').trimStart();
- if (cleaned !== original) {
+ const withoutMemories = original.replace(memoryPattern, '');
+ if (withoutMemories !== original) {
+ const cleaned = withoutMemories.trimStart();
node.textContent = cleaned;
changed = true;
}
From be22a393f973bd5e1ac7b82bd4ca099cf70a46f9 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 09:46:54 +0000
Subject: [PATCH 67/83] Propagate CancellationToken through WebSocket TTS call
chain
Co-authored-by: NichUK <346792+NichUK@users.noreply.github.com>
Agent-Logs-Url: https://github.com/NichUK/openclaw-windows-node/sessions/b0f37bbe-5816-430c-9069-9ebbdd02b0a1
---
.../Voice/VoiceCloudTextToSpeechClient.cs | 32 ++++++++++---------
.../Services/Voice/VoiceService.cs | 2 +-
2 files changed, 18 insertions(+), 16 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index ad36486..ac1df2e 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -23,7 +23,8 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
string text,
VoiceProviderOption provider,
VoiceProviderConfigurationStore configurationStore,
- IOpenClawLogger? logger = null)
+ IOpenClawLogger? logger = null,
+ CancellationToken cancellationToken = default)
{
ArgumentException.ThrowIfNullOrWhiteSpace(text);
ArgumentNullException.ThrowIfNull(provider);
@@ -31,7 +32,7 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
if (provider.TextToSpeechWebSocket != null)
{
- return await SynthesizeViaWebSocketAsync(text, provider, configurationStore, logger);
+ return await SynthesizeViaWebSocketAsync(text, provider, configurationStore, logger, cancellationToken);
}
var contract = provider.TextToSpeechHttp
@@ -83,7 +84,8 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
string text,
VoiceProviderOption provider,
VoiceProviderConfigurationStore configurationStore,
- IOpenClawLogger? logger)
+ IOpenClawLogger? logger,
+ CancellationToken cancellationToken)
{
var contract = provider.TextToSpeechWebSocket
?? throw new InvalidOperationException($"TTS provider '{provider.Name}' does not expose a WebSocket contract.");
@@ -94,21 +96,21 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
ApplyAuthenticationHeader(socket.Options, contract, templateValues);
var stopwatch = Stopwatch.StartNew();
- await socket.ConnectAsync(new Uri(endpoint), CancellationToken.None);
- var connectedMessage = await ReceiveJsonMessageAsync(socket);
+ await socket.ConnectAsync(new Uri(endpoint), cancellationToken);
+ var connectedMessage = await ReceiveJsonMessageAsync(socket, cancellationToken);
ValidateWebSocketEvent(provider.Name, contract.ConnectSuccessEventName, connectedMessage, contract);
var startMessage = ApplyJsonTemplate(contract.StartMessageTemplate, templateValues);
- await SendTextMessageAsync(socket, startMessage);
- var startedMessage = await ReceiveJsonMessageAsync(socket);
+ await SendTextMessageAsync(socket, startMessage, cancellationToken);
+ var startedMessage = await ReceiveJsonMessageAsync(socket, cancellationToken);
ValidateWebSocketEvent(provider.Name, contract.StartSuccessEventName, startedMessage, contract);
var continueMessage = ApplyJsonTemplate(contract.ContinueMessageTemplate, templateValues);
- await SendTextMessageAsync(socket, continueMessage);
+ await SendTextMessageAsync(socket, continueMessage, cancellationToken);
if (!string.IsNullOrWhiteSpace(contract.FinishMessageTemplate))
{
- await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues));
+ await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues), cancellationToken);
}
var audioBytes = new List<byte>();
@@ -116,7 +118,7 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
while (true)
{
- var message = await ReceiveJsonMessageAsync(socket);
+ var message = await ReceiveJsonMessageAsync(socket, cancellationToken);
EnsureWebSocketNotFailed(provider.Name, contract, message);
if (TryGetJsonString(message, contract.ResponseAudioJsonPath, out var audioChunk) &&
@@ -138,7 +140,7 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
try
{
- await socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "done", CancellationToken.None);
+ await socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "done", cancellationToken);
}
catch
{
@@ -470,13 +472,13 @@ private static async Task<VoiceCloudTextToSpeechResult> CreateResultAsync(Stream
return new VoiceCloudTextToSpeechResult(stream, string.IsNullOrWhiteSpace(contentType) ? "audio/mpeg" : contentType);
}
- private static async Task SendTextMessageAsync(ClientWebSocket socket, string message)
+ private static async Task SendTextMessageAsync(ClientWebSocket socket, string message, CancellationToken cancellationToken)
{
var bytes = Encoding.UTF8.GetBytes(message);
- await socket.SendAsync(bytes, WebSocketMessageType.Text, true, CancellationToken.None);
+ await socket.SendAsync(bytes, WebSocketMessageType.Text, true, cancellationToken);
}
- private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket socket)
+ private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket socket, CancellationToken cancellationToken)
{
using var buffer = new MemoryStream();
var receiveBuffer = new byte[8192];
@@ -484,7 +486,7 @@ private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket s
while (true)
{
var segment = new ArraySegment<byte>(receiveBuffer);
- var result = await socket.ReceiveAsync(segment, CancellationToken.None);
+ var result = await socket.ReceiveAsync(segment, cancellationToken);
if (result.MessageType == WebSocketMessageType.Close)
{
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 02908d2..32c5a16 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1454,7 +1454,7 @@ private async Task SpeakTextAsync(string text, CancellationToken cancellationTok
if (UsesCloudTextToSpeechRuntime(provider))
{
- using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger);
+ using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger, cancellationToken);
await PlayStreamAsync(player, result.Stream, result.ContentType, cancellationToken);
return;
}
From 5bad5420bfcb51e2f78d0c26cd289c163021687a Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 09:47:31 +0000
Subject: [PATCH 68/83] Thread CancellationToken through WebSocket TTS
operations to prevent indefinite hangs
Co-authored-by: NichUK <346792+NichUK@users.noreply.github.com>
Agent-Logs-Url: https://github.com/NichUK/openclaw-windows-node/sessions/0cc237fa-b2b8-427a-83e8-4375e2c3f2fc
---
.../Voice/VoiceCloudTextToSpeechClient.cs | 36 +++++++++++--------
.../Services/Voice/VoiceService.cs | 2 +-
2 files changed, 22 insertions(+), 16 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index ad36486..f509d79 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -23,7 +23,8 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
string text,
VoiceProviderOption provider,
VoiceProviderConfigurationStore configurationStore,
- IOpenClawLogger? logger = null)
+ IOpenClawLogger? logger = null,
+ CancellationToken cancellationToken = default)
{
ArgumentException.ThrowIfNullOrWhiteSpace(text);
ArgumentNullException.ThrowIfNull(provider);
@@ -31,7 +32,7 @@ public async Task<VoiceCloudTextToSpeechResult> SynthesizeAsync(
if (provider.TextToSpeechWebSocket != null)
{
- return await SynthesizeViaWebSocketAsync(text, provider, configurationStore, logger);
+ return await SynthesizeViaWebSocketAsync(text, provider, configurationStore, logger, cancellationToken);
}
var contract = provider.TextToSpeechHttp
@@ -83,7 +84,8 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
string text,
VoiceProviderOption provider,
VoiceProviderConfigurationStore configurationStore,
- IOpenClawLogger? logger)
+ IOpenClawLogger? logger,
+ CancellationToken cancellationToken)
{
var contract = provider.TextToSpeechWebSocket
?? throw new InvalidOperationException($"TTS provider '{provider.Name}' does not expose a WebSocket contract.");
@@ -93,22 +95,26 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
using var socket = new ClientWebSocket();
ApplyAuthenticationHeader(socket.Options, contract, templateValues);
+ using var timeoutCts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
+ using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken, timeoutCts.Token);
+ var ct = linkedCts.Token;
+
var stopwatch = Stopwatch.StartNew();
- await socket.ConnectAsync(new Uri(endpoint), CancellationToken.None);
- var connectedMessage = await ReceiveJsonMessageAsync(socket);
+ await socket.ConnectAsync(new Uri(endpoint), ct);
+ var connectedMessage = await ReceiveJsonMessageAsync(socket, ct);
ValidateWebSocketEvent(provider.Name, contract.ConnectSuccessEventName, connectedMessage, contract);
var startMessage = ApplyJsonTemplate(contract.StartMessageTemplate, templateValues);
- await SendTextMessageAsync(socket, startMessage);
- var startedMessage = await ReceiveJsonMessageAsync(socket);
+ await SendTextMessageAsync(socket, startMessage, ct);
+ var startedMessage = await ReceiveJsonMessageAsync(socket, ct);
ValidateWebSocketEvent(provider.Name, contract.StartSuccessEventName, startedMessage, contract);
var continueMessage = ApplyJsonTemplate(contract.ContinueMessageTemplate, templateValues);
- await SendTextMessageAsync(socket, continueMessage);
+ await SendTextMessageAsync(socket, continueMessage, ct);
if (!string.IsNullOrWhiteSpace(contract.FinishMessageTemplate))
{
- await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues));
+ await SendTextMessageAsync(socket, ApplyJsonTemplate(contract.FinishMessageTemplate, templateValues), ct);
}
var audioBytes = new List<byte>();
@@ -116,7 +122,7 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
while (true)
{
- var message = await ReceiveJsonMessageAsync(socket);
+ var message = await ReceiveJsonMessageAsync(socket, ct);
EnsureWebSocketNotFailed(provider.Name, contract, message);
if (TryGetJsonString(message, contract.ResponseAudioJsonPath, out var audioChunk) &&
@@ -138,7 +144,7 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
try
{
- await socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "done", CancellationToken.None);
+ await socket.CloseAsync(WebSocketCloseStatus.NormalClosure, "done", ct);
}
catch
{
@@ -470,13 +476,13 @@ private static async Task<VoiceCloudTextToSpeechResult> CreateResultAsync(Stream
return new VoiceCloudTextToSpeechResult(stream, string.IsNullOrWhiteSpace(contentType) ? "audio/mpeg" : contentType);
}
- private static async Task SendTextMessageAsync(ClientWebSocket socket, string message)
+ private static async Task SendTextMessageAsync(ClientWebSocket socket, string message, CancellationToken cancellationToken)
{
var bytes = Encoding.UTF8.GetBytes(message);
- await socket.SendAsync(bytes, WebSocketMessageType.Text, true, CancellationToken.None);
+ await socket.SendAsync(bytes, WebSocketMessageType.Text, true, cancellationToken);
}
- private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket socket)
+ private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket socket, CancellationToken cancellationToken)
{
using var buffer = new MemoryStream();
var receiveBuffer = new byte[8192];
@@ -484,7 +490,7 @@ private static async Task<JsonElement> ReceiveJsonMessageAsync(ClientWebSocket s
while (true)
{
var segment = new ArraySegment<byte>(receiveBuffer);
- var result = await socket.ReceiveAsync(segment, CancellationToken.None);
+ var result = await socket.ReceiveAsync(segment, cancellationToken);
if (result.MessageType == WebSocketMessageType.Close)
{
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 02908d2..32c5a16 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1454,7 +1454,7 @@ private async Task SpeakTextAsync(string text, CancellationToken cancellationTok
if (UsesCloudTextToSpeechRuntime(provider))
{
- using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger);
+ using var result = await _cloudTextToSpeechClient.SynthesizeAsync(text, provider, providerConfiguration, _logger, cancellationToken);
await PlayStreamAsync(player, result.Stream, result.ContentType, cancellationToken);
return;
}
From 2a93e5a74be6d2454aec0f2a7bb1b1f6ca5988a4 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 10:05:07 +0000
Subject: [PATCH 69/83] Simplify to single CancellationTokenSource with linked
token and timeout
Co-authored-by: NichUK <346792+NichUK@users.noreply.github.com>
Agent-Logs-Url: https://github.com/NichUK/openclaw-windows-node/sessions/556f63ce-3524-4508-9ac9-5b05a7697956
---
.../Services/Voice/VoiceCloudTextToSpeechClient.cs | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
index f509d79..10ec354 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceCloudTextToSpeechClient.cs
@@ -95,9 +95,9 @@ private static async Task<VoiceCloudTextToSpeechResult> SynthesizeViaWebSocketAs
using var socket = new ClientWebSocket();
ApplyAuthenticationHeader(socket.Options, contract, templateValues);
- using var timeoutCts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
- using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken, timeoutCts.Token);
- var ct = linkedCts.Token;
+ using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
+ cts.CancelAfter(TimeSpan.FromSeconds(30));
+ var ct = cts.Token;
var stopwatch = Stopwatch.StartNew();
await socket.ConnectAsync(new Uri(endpoint), ct);
From c8f5ba1c2f6379941d677cd427cf8823eadbdffd Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 26 Mar 2026 10:12:26 +0000
Subject: [PATCH 70/83] Add VoiceCloudTextToSpeechClient cancellation and
decode tests
Co-authored-by: NichUK <346792+NichUK@users.noreply.github.com>
Agent-Logs-Url: https://github.com/NichUK/openclaw-windows-node/sessions/368e6f83-a2f3-412c-bac7-47d57ddd4d92
---
.../VoiceCloudTextToSpeechClientTests.cs | 75 +++++++++++++++++++
1 file changed, 75 insertions(+)
create mode 100644 tests/OpenClaw.Tray.Tests/VoiceCloudTextToSpeechClientTests.cs
diff --git a/tests/OpenClaw.Tray.Tests/VoiceCloudTextToSpeechClientTests.cs b/tests/OpenClaw.Tray.Tests/VoiceCloudTextToSpeechClientTests.cs
new file mode 100644
index 0000000..75cefc0
--- /dev/null
+++ b/tests/OpenClaw.Tray.Tests/VoiceCloudTextToSpeechClientTests.cs
@@ -0,0 +1,75 @@
+using System;
+using System.Reflection;
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+using OpenClawTray.Services.Voice;
+
+namespace OpenClaw.Tray.Tests;
+
+public class VoiceCloudTextToSpeechClientTests
+{
+ [Fact]
+ public async Task SynthesizeAsync_ThrowsOperationCanceled_WhenCallerTokenIsPreCancelled()
+ {
+ var client = new VoiceCloudTextToSpeechClient();
+ var provider = new VoiceProviderOption
+ {
+ Id = "test-ws",
+ Name = "Test WS",
+ Settings =
+ [
+ new VoiceProviderSettingDefinition { Key = "apiKey", Secret = true }
+ ],
+ TextToSpeechWebSocket = new VoiceTextToSpeechWebSocketContract
+ {
+ EndpointTemplate = "wss://127.0.0.1:0/tts"
+ }
+ };
+ var store = new VoiceProviderConfigurationStore();
+ store.SetValue("test-ws", "apiKey", "test-key");
+
+ using var cts = new CancellationTokenSource();
+ cts.Cancel();
+
+ await Assert.ThrowsAnyAsync<OperationCanceledException>(
+ () => client.SynthesizeAsync("hello", provider, store, cancellationToken: cts.Token));
+ }
+
+ [Fact]
+ public void DecodeAudioBytes_DecodesHexString()
+ {
+ var result = InvokeDecodeAudioBytes("hexJsonString", "48656c6c6f", "TestProvider");
+
+ Assert.Equal([72, 101, 108, 108, 111], result); // "Hello"
+ }
+
+ [Fact]
+ public void DecodeAudioBytes_DecodesBase64String()
+ {
+ var result = InvokeDecodeAudioBytes("base64JsonString", "SGVsbG8=", "TestProvider");
+
+ Assert.Equal([72, 101, 108, 108, 111], result); // "Hello"
+ }
+
+ [Fact]
+ public void DecodeAudioBytes_ThrowsForUnsupportedMode()
+ {
+ var method = GetDecodeAudioBytesMethod();
+
+ var ex = Assert.Throws<TargetInvocationException>(
+ () => method.Invoke(null, ["unsupported", "data", "TestProvider"]));
+
+ Assert.IsType<InvalidOperationException>(ex.InnerException);
+ }
+
+ private static byte[] InvokeDecodeAudioBytes(string mode, string value, string providerName)
+ {
+ return (byte[])GetDecodeAudioBytesMethod().Invoke(null, [mode, value, providerName])!;
+ }
+
+ private static MethodInfo GetDecodeAudioBytesMethod() =>
+ typeof(VoiceCloudTextToSpeechClient).GetMethod(
+ "DecodeAudioBytes",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+}
From 8b81870e9e272631e20bda706eba5b61b18a865a Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 10:37:42 +0000
Subject: [PATCH 71/83] Log talk mode recognizer restart decisions
---
docs/VOICE-MODE.md | 1 +
.../Services/Voice/VoiceService.cs | 93 ++++++++++++++++++-
.../VoiceServiceTransportTests.cs | 55 +++++++++++
3 files changed, 148 insertions(+), 1 deletion(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index c087286..5a46afb 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -1075,3 +1075,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Adjusted the ElevenLabs `stream-input` contract to send `xi_api_key` in the init message and `flush: true` with the text turn so short replies are emitted promptly instead of being buffered then closed.
- `2026-03-25` Stopped sending an immediate ElevenLabs EOS message before reading audio, so single-turn `stream-input` replies rely on `flush: true` instead of prematurely closing the socket.
- `2026-03-25` Switched ElevenLabs `stream-input` to the published text-generation pattern: send the text turn with a trailing-space variant plus `try_trigger_generation: true`, then send EOS, and include the WebSocket close status in playback errors.
+- `2026-03-26` Merged the latest `feature/voice-mode` updates and added explicit recognizer completion decision logging so Talk Mode now records why a stopped Windows speech session did or did not restart/rebuild after idle or watchdog completions.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 32c5a16..b985474 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1517,6 +1517,41 @@ internal static bool ShouldRestartRecognitionAfterCompletion(
!isSpeaking;
}
+ internal static string DescribeRecognitionCompletionRestartDecision(
+ bool running,
+ VoiceActivationMode mode,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking)
+ {
+ if (!running)
+ {
+ return "runtime-not-running";
+ }
+
+ if (mode != VoiceActivationMode.TalkMode)
+ {
+ return $"mode={mode}";
+ }
+
+ if (restartInProgress)
+ {
+ return "controlled-restart-in-progress";
+ }
+
+ if (awaitingReply)
+ {
+ return "awaiting-reply";
+ }
+
+ if (isSpeaking)
+ {
+ return "speaking";
+ }
+
+ return "eligible";
+ }
+
internal static bool ShouldRebuildRecognitionAfterCompletion(
SpeechRecognitionResultStatus status,
bool sessionHadActivity,
@@ -1535,6 +1570,47 @@ internal static bool ShouldRebuildRecognitionAfterCompletion(
status == SpeechRecognitionResultStatus.TimeoutExceeded;
}
+ internal static string DescribeRecognitionCompletionRebuildDecision(
+ SpeechRecognitionResultStatus status,
+ bool sessionHadActivity,
+ bool sessionHadCaptureSignal,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking)
+ {
+ if (restartInProgress)
+ {
+ return "controlled-restart-in-progress";
+ }
+
+ if (awaitingReply)
+ {
+ return "awaiting-reply";
+ }
+
+ if (isSpeaking)
+ {
+ return "speaking";
+ }
+
+ if (sessionHadActivity)
+ {
+ return "session-had-activity";
+ }
+
+ if (sessionHadCaptureSignal)
+ {
+ return "capture-signal-without-recognition";
+ }
+
+ return status switch
+ {
+ SpeechRecognitionResultStatus.UserCanceled => "user-canceled-without-activity",
+ SpeechRecognitionResultStatus.TimeoutExceeded => "timeout-without-activity",
+ _ => $"status={status}"
+ };
+ }
+
internal static string SelectRecognizedText(
string recognizedText,
string? latestHypothesisText,
@@ -1675,6 +1751,8 @@ private async void OnSpeechRecognitionCompleted(
var restartInProgress = false;
var sessionHadActivity = false;
var sessionHadCaptureSignal = false;
+ var restartDecisionReason = string.Empty;
+ var rebuildDecisionReason = string.Empty;
string? fallbackText = null;
lock (_gate)
@@ -1713,6 +1791,12 @@ private async void OnSpeechRecognitionCompleted(
restartInProgress,
_awaitingReply,
_isSpeaking);
+ restartDecisionReason = DescribeRecognitionCompletionRestartDecision(
+ _status.Running,
+ _status.Mode,
+ restartInProgress,
+ _awaitingReply,
+ _isSpeaking);
shouldRebuildRecognizer = ShouldRebuildRecognitionAfterCompletion(
args.Status,
sessionHadActivity,
@@ -1720,10 +1804,17 @@ private async void OnSpeechRecognitionCompleted(
restartInProgress,
_awaitingReply,
_isSpeaking);
+ rebuildDecisionReason = DescribeRecognitionCompletionRebuildDecision(
+ args.Status,
+ sessionHadActivity,
+ sessionHadCaptureSignal,
+ restartInProgress,
+ _awaitingReply,
+ _isSpeaking);
}
_logger.Warn(
- $"Speech recognition session completed with status {args.Status}; restart={shouldRestart}; rebuild={shouldRebuildRecognizer}; hadActivity={sessionHadActivity}; hadCaptureSignal={sessionHadCaptureSignal}");
+ $"Speech recognition session completed with status {args.Status}; restart={shouldRestart} ({restartDecisionReason}); rebuild={shouldRebuildRecognizer} ({rebuildDecisionReason}); hadActivity={sessionHadActivity}; hadCaptureSignal={sessionHadCaptureSignal}");
if (!string.IsNullOrWhiteSpace(fallbackText) &&
!_awaitingReply &&
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index b1bc39b..6429217 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -151,6 +151,32 @@ public void ShouldRestartRecognitionAfterCompletion_SuppressesControlledRecycle(
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(true, VoiceActivationMode.TalkMode, false, false, false, "eligible")]
+ [InlineData(true, VoiceActivationMode.VoiceWake, false, false, false, "mode=VoiceWake")]
+ [InlineData(false, VoiceActivationMode.TalkMode, false, false, false, "runtime-not-running")]
+ [InlineData(true, VoiceActivationMode.TalkMode, true, false, false, "controlled-restart-in-progress")]
+ [InlineData(true, VoiceActivationMode.TalkMode, false, true, false, "awaiting-reply")]
+ [InlineData(true, VoiceActivationMode.TalkMode, false, false, true, "speaking")]
+ public void DescribeRecognitionCompletionRestartDecision_ExplainsWhyRestartIsBlocked(
+ bool running,
+ VoiceActivationMode mode,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking,
+ string expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "DescribeRecognitionCompletionRestartDecision",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (string)method.Invoke(
+ null,
+ [running, mode, restartInProgress, awaitingReply, isSpeaking])!;
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, true)]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, true)]
@@ -180,6 +206,35 @@ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledS
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, false, false, false, "capture-signal-without-recognition")]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, "user-canceled-without-activity")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, "timeout-without-activity")]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, "status=Success")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, true, true, false, false, false, "session-had-activity")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, true, false, false, "controlled-restart-in-progress")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, false, true, false, "awaiting-reply")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, false, false, true, "speaking")]
+ public void DescribeRecognitionCompletionRebuildDecision_ExplainsWhyRebuildIsBlocked(
+ SpeechRecognitionResultStatus status,
+ bool sessionHadActivity,
+ bool sessionHadCaptureSignal,
+ bool restartInProgress,
+ bool awaitingReply,
+ bool isSpeaking,
+ string expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "DescribeRecognitionCompletionRebuildDecision",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (string)method.Invoke(
+ null,
+ [status, sessionHadActivity, sessionHadCaptureSignal, restartInProgress, awaitingReply, isSpeaking])!;
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(0.029f, false)]
[InlineData(0.03f, true)]
From f1db7b7c17c0ecf2aebf2b1c24f08de6bab45ca9 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 10:49:59 +0000
Subject: [PATCH 72/83] Fix stale talk mode restart latch
---
docs/VOICE-MODE.md | 1 +
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs | 2 ++
2 files changed, 3 insertions(+)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 5a46afb..198023c 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -1076,3 +1076,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Stopped sending an immediate ElevenLabs EOS message before reading audio, so single-turn `stream-input` replies rely on `flush: true` instead of prematurely closing the socket.
- `2026-03-25` Switched ElevenLabs `stream-input` to the published text-generation pattern: send the text turn with a trailing-space variant plus `try_trigger_generation: true`, then send EOS, and include the WebSocket close status in playback errors.
- `2026-03-26` Merged the latest `feature/voice-mode` updates and added explicit recognizer completion decision logging so Talk Mode now records why a stopped Windows speech session did or did not restart/rebuild after idle or watchdog completions.
+- `2026-03-26` Cleared the controlled-restart latch as soon as a new recognition session successfully starts, and on failed resume attempts, so Talk Mode does not get stranded in `controlled-restart-in-progress` after an idle/deaf recognizer recycle.
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index b985474..d0e939a 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -758,6 +758,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
lock (_gate)
{
_recognitionActive = true;
+ _recognitionRestartInProgress = false;
_recognitionSessionHadActivity = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
@@ -875,6 +876,7 @@ private async Task ResumeRecognitionSessionAsync(
lock (_gate)
{
+ _recognitionRestartInProgress = false;
if (_runtimeCts == null ||
!_status.Running ||
_status.Mode != VoiceActivationMode.TalkMode ||
From 1515e438ab41d6f350c72132b68700cfe5bf5cfb Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 16:57:52 +0000
Subject: [PATCH 73/83] Split Talk Mode STT into route classes
---
docs/VOICE-MODE.md | 1 +
src/OpenClaw.Shared/VoiceModeSchema.cs | 14 +++-
.../Assets/voice-providers.json | 52 ++++++++++++++-
.../Controls/VoiceSettingsPanel.xaml.cs | 8 ++-
.../AudioGraphStreamingSpeechToTextRoute.cs | 29 +++++++++
.../Services/Voice/IVoiceSpeechToTextRoute.cs | 15 +++++
.../Voice/SherpaOnnxSpeechToTextRoute.cs | 29 +++++++++
.../Voice/VoiceProviderCatalogService.cs | 19 ++++++
.../Services/Voice/VoiceService.cs | 65 ++++++++-----------
.../Voice/VoiceSpeechToTextRouteFactory.cs | 45 +++++++++++++
.../Voice/VoiceSpeechToTextRouteKind.cs | 8 +++
.../Voice/VoiceSpeechToTextRouteResources.cs | 9 +++
.../Voice/WindowsMediaSpeechToTextRoute.cs | 51 +++++++++++++++
.../VoiceModeSchemaTests.cs | 4 ++
.../VoiceProviderCatalogServiceTests.cs | 22 ++++++-
15 files changed, 330 insertions(+), 41 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/AudioGraphStreamingSpeechToTextRoute.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/IVoiceSpeechToTextRoute.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/SherpaOnnxSpeechToTextRoute.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteFactory.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteKind.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteResources.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 198023c..1d449b8 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -1077,3 +1077,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-25` Switched ElevenLabs `stream-input` to the published text-generation pattern: send the text turn with a trailing-space variant plus `try_trigger_generation: true`, then send EOS, and include the WebSocket close status in playback errors.
- `2026-03-26` Merged the latest `feature/voice-mode` updates and added explicit recognizer completion decision logging so Talk Mode now records why a stopped Windows speech session did or did not restart/rebuild after idle or watchdog completions.
- `2026-03-26` Cleared the controlled-restart latch as soon as a new recognition session successfully starts, and on failed resume attempts, so Talk Mode does not get stranded in `controlled-restart-in-progress` after an idle/deaf recognizer recycle.
+- `2026-03-26` Split Talk Mode STT startup into explicit route classes: the built-in `windows` provider now uses a pure `Windows.Media` path with no AudioGraph, while `foundry-local` and `sherpa-onnx` are separated into dedicated future AudioGraph/embedded route classes instead of sharing the Windows pipeline.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 6c52886..62ab03b 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -154,14 +154,26 @@ public sealed class VoiceSettingsUpdateArgs
public static class VoiceProviderIds
{
public const string Windows = "windows";
+ public const string FoundryLocal = "foundry-local";
+ public const string SherpaOnnx = "sherpa-onnx";
public const string MiniMax = "minimax";
public const string ElevenLabs = "elevenlabs";
}
+public static class VoiceProviderRuntimeIds
+{
+ public const string Windows = "windows";
+ public const string Streaming = "streaming";
+ public const string Embedded = "embedded";
+ public const string Cloud = "cloud";
+}
+
public static class VoiceProviderSettingKeys
{
public const string ApiKey = "apiKey";
+ public const string Endpoint = "endpoint";
public const string Model = "model";
+ public const string ModelPath = "modelPath";
public const string VoiceId = "voiceId";
public const string VoiceSettingsJson = "voiceSettingsJson";
}
@@ -249,7 +261,7 @@ public sealed class VoiceProviderOption
{
public string Id { get; set; } = "";
public string Name { get; set; } = "";
- public string Runtime { get; set; } = "windows";
+ public string Runtime { get; set; } = VoiceProviderRuntimeIds.Windows;
public bool Enabled { get; set; } = true;
public string? Description { get; set; }
public List<VoiceProviderSettingDefinition> Settings { get; set; } = [];
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index 996bfce..5bd32cd 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -5,7 +5,57 @@
"name": "Windows Speech Recognition",
"runtime": "windows",
"enabled": true,
- "description": "Built-in Windows dictation and speech recognition."
+ "description": "Built-in Windows.Media speech recognition without AudioGraph."
+ },
+ {
+ "id": "foundry-local",
+ "name": "Foundry Local",
+ "runtime": "streaming",
+ "enabled": true,
+ "description": "AudioGraph-fed streaming STT route for Foundry Local or compatible streaming adapters.",
+ "settings": [
+ {
+ "key": "endpoint",
+ "label": "Endpoint",
+ "required": false,
+ "defaultValue": "http://localhost:5273",
+ "placeholder": "http://localhost:5273",
+ "description": "Local Foundry-compatible transcription endpoint for the AudioGraph streaming STT route."
+ },
+ {
+ "key": "model",
+ "label": "Model",
+ "required": false,
+ "defaultValue": "whisper-tiny",
+ "placeholder": "whisper-tiny",
+ "description": "Transcription model identifier for the streaming STT adapter."
+ }
+ ]
+ },
+ {
+ "id": "sherpa-onnx",
+ "name": "sherpa-onnx",
+ "runtime": "embedded",
+ "enabled": true,
+ "description": "Local embedded STT route for user-supplied sherpa-onnx model bundles.",
+ "settings": [
+ {
+ "key": "modelPath",
+ "label": "Model path",
+ "required": false,
+ "defaultValue": "",
+ "placeholder": "C:\\models\\sherpa-onnx\\model.onnx",
+ "description": "Path to the downloaded sherpa-onnx model bundle the embedded STT route should use."
+ },
+ {
+ "key": "model",
+ "label": "Model preset",
+ "required": false,
+ "defaultValue": "",
+ "placeholder": "tiny / base / small / medium",
+ "description": "Optional human-readable model preset to help track which local bundle is selected."
+ }
+ ]
}
],
"textToSpeechProviders": [
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index ce985be..2345543 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -204,10 +204,16 @@ private void UpdateVoiceSettingsInfo()
var output = (VoiceOutputDeviceComboBox.SelectedItem as DeviceOption)?.Name ?? "System default speaker";
var fallbackNotice = string.Empty;
+ if (VoiceSpeechToTextProviderComboBox.SelectedItem is VoiceProviderOption sttOption &&
+ !VoiceProviderCatalogService.SupportsSpeechToTextRuntime(sttOption.Id))
+ {
+ fallbackNotice += " Selected non-Windows STT routes are scaffolded but not implemented yet.";
+ }
+
if (VoiceTextToSpeechProviderComboBox.SelectedItem is VoiceProviderOption ttsOption &&
!VoiceProviderCatalogService.SupportsTextToSpeechRuntime(ttsOption.Id))
{
- fallbackNotice = " Unsupported TTS providers will fall back to Windows until their runtime adapters are added.";
+ fallbackNotice += " Unsupported TTS providers will fall back to Windows until their runtime adapters are added.";
}
VoiceSettingsInfoTextBlock.Text =
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/AudioGraphStreamingSpeechToTextRoute.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/AudioGraphStreamingSpeechToTextRoute.cs
new file mode 100644
index 0000000..088200a
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/AudioGraphStreamingSpeechToTextRoute.cs
@@ -0,0 +1,29 @@
+using System;
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+
+namespace OpenClawTray.Services.Voice;
+
+internal sealed class AudioGraphStreamingSpeechToTextRoute : IVoiceSpeechToTextRoute
+{
+ private readonly IOpenClawLogger _logger;
+
+ public AudioGraphStreamingSpeechToTextRoute(IOpenClawLogger logger)
+ {
+ _logger = logger;
+ }
+
+ public VoiceSpeechToTextRouteKind Kind => VoiceSpeechToTextRouteKind.Streaming;
+
+ public Task<VoiceSpeechToTextRouteResources> StartAsync(
+ VoiceProviderOption provider,
+ VoiceSettings settings,
+ CancellationToken cancellationToken)
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+ _logger.Info($"Selected streaming STT route for provider '{provider.Name}'.");
+ throw new NotSupportedException(
+ $"STT provider '{provider.Name}' is assigned to the AudioGraph streaming route, but that adapter is not implemented yet.");
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/IVoiceSpeechToTextRoute.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/IVoiceSpeechToTextRoute.cs
new file mode 100644
index 0000000..16e3350
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/IVoiceSpeechToTextRoute.cs
@@ -0,0 +1,15 @@
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+
+namespace OpenClawTray.Services.Voice;
+
+internal interface IVoiceSpeechToTextRoute
+{
+ VoiceSpeechToTextRouteKind Kind { get; }
+
+ Task<VoiceSpeechToTextRouteResources> StartAsync(
+ VoiceProviderOption provider,
+ VoiceSettings settings,
+ CancellationToken cancellationToken);
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/SherpaOnnxSpeechToTextRoute.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/SherpaOnnxSpeechToTextRoute.cs
new file mode 100644
index 0000000..3698d51
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/SherpaOnnxSpeechToTextRoute.cs
@@ -0,0 +1,29 @@
+using System;
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+
+namespace OpenClawTray.Services.Voice;
+
+internal sealed class SherpaOnnxSpeechToTextRoute : IVoiceSpeechToTextRoute
+{
+ private readonly IOpenClawLogger _logger;
+
+ public SherpaOnnxSpeechToTextRoute(IOpenClawLogger logger)
+ {
+ _logger = logger;
+ }
+
+ public VoiceSpeechToTextRouteKind Kind => VoiceSpeechToTextRouteKind.SherpaOnnx;
+
+ public Task<VoiceSpeechToTextRouteResources> StartAsync(
+ VoiceProviderOption provider,
+ VoiceSettings settings,
+ CancellationToken cancellationToken)
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+ _logger.Info($"Selected embedded sherpa-onnx STT route for provider '{provider.Name}'.");
+ throw new NotSupportedException(
+ "The sherpa-onnx STT route is not implemented yet. This route will require a user-provided local model bundle.");
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index a612576..9516a20 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -71,6 +71,25 @@ public static bool SupportsWindowsRuntime(string? providerId)
return string.Equals(providerId, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase);
}
+ public static bool SupportsSpeechToTextRuntime(string? providerId)
+ {
+ try
+ {
+ var provider = ResolveSpeechToTextProvider(providerId);
+ return VoiceSpeechToTextRouteFactory.ResolveRouteKind(provider) == VoiceSpeechToTextRouteKind.WindowsMedia;
+ }
+ catch
+ {
+ return false;
+ }
+ }
+
+ internal static VoiceSpeechToTextRouteKind ResolveSpeechToTextRouteKind(string? providerId, IOpenClawLogger? logger = null)
+ {
+ var provider = ResolveSpeechToTextProvider(providerId, logger);
+ return VoiceSpeechToTextRouteFactory.ResolveRouteKind(provider);
+ }
+
public static bool SupportsTextToSpeechRuntime(string? providerId)
{
if (SupportsWindowsRuntime(providerId))
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index d0e939a..58916cb 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -51,6 +51,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private ConnectionStatus _chatTransportStatus = ConnectionStatus.Disconnected;
private TaskCompletionSource<bool>? _transportReadyTcs;
private VoiceCaptureService? _voiceCaptureService;
+ private IVoiceSpeechToTextRoute? _speechToTextRoute;
private SpeechRecognizer? _speechRecognizer;
private SpeechSynthesizer? _speechSynthesizer;
private MediaPlayer? _mediaPlayer;
@@ -514,6 +515,7 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
await EnsureMicrophoneConsentAsync();
CancellationTokenSource? runtimeCts = null;
+ IVoiceSpeechToTextRoute? speechToTextRoute = null;
VoiceCaptureService? captureService = null;
SpeechRecognizer? recognizer = null;
SpeechSynthesizer? synthesizer = null;
@@ -522,28 +524,31 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
try
{
runtimeCts = new CancellationTokenSource();
- captureService = new VoiceCaptureService(_logger);
- captureService.SignalDetected += OnCaptureSignalDetected;
- await captureService.StartAsync(settings, runtimeCts.Token);
- recognizer = await CreateSpeechRecognizerAsync(settings);
+ speechToTextRoute = VoiceSpeechToTextRouteFactory.Create(selectedSpeechToText, _logger);
+ var speechToTextResources = await speechToTextRoute.StartAsync(selectedSpeechToText, settings, runtimeCts.Token);
+ captureService = speechToTextResources.CaptureService;
+ recognizer = speechToTextResources.SpeechRecognizer;
synthesizer = new SpeechSynthesizer();
player = new MediaPlayer();
await ConfigurePlaybackOutputDeviceAsync(player, settings);
- if (!string.IsNullOrWhiteSpace(settings.InputDeviceId))
+ if (captureService != null)
{
- _logger.Warn(
- "AudioGraph capture is bound to the selected input device, but Windows STT transcription still follows the system speech input path until the STT adapter migration is complete.");
+ captureService.SignalDetected += OnCaptureSignalDetected;
}
- recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
- recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
- recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
+ if (recognizer != null)
+ {
+ recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
+ recognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
+ recognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
+ }
lock (_gate)
{
_runtimeCts = runtimeCts;
_voiceCaptureService = captureService;
+ _speechToTextRoute = speechToTextRoute;
_speechRecognizer = recognizer;
_speechSynthesizer = synthesizer;
_mediaPlayer = player;
@@ -596,26 +601,6 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
}
}
- private async Task<SpeechRecognizer> CreateSpeechRecognizerAsync(VoiceSettings settings)
- {
- var recognizer = new SpeechRecognizer();
- recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromMilliseconds(settings.TalkMode.EndSilenceMs);
- recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(10);
- recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(4);
- recognizer.Constraints.Add(new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.Dictation, "always-on-dictation"));
-
- var compilation = await recognizer.CompileConstraintsAsync();
- if (compilation.Status != SpeechRecognitionResultStatus.Success)
- {
- recognizer.Dispose();
- throw new InvalidOperationException($"Speech recognizer unavailable: {compilation.Status}");
- }
-
- _logger.Info($"Speech recognizer compiled successfully ({compilation.Status})");
-
- return recognizer;
- }
-
private async Task ConfigurePlaybackOutputDeviceAsync(MediaPlayer player, VoiceSettings settings)
{
if (string.IsNullOrWhiteSpace(settings.OutputDeviceId))
@@ -791,12 +776,10 @@ private async Task MonitorListeningReadyAsync(int generation, CancellationToken
captureService = _voiceCaptureService;
}
- if (captureService == null)
+ if (captureService != null)
{
- return;
+ await captureService.WaitForCaptureReadyAsync(cancellationToken);
}
-
- await captureService.WaitForCaptureReadyAsync(cancellationToken);
await Task.Delay(InitialRecognitionReadyDelay, cancellationToken);
var transitionedToListening = false;
@@ -1908,6 +1891,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
captureService = _voiceCaptureService;
_voiceCaptureService = null;
+ _speechToTextRoute = null;
recognizer = _speechRecognizer;
_speechRecognizer = null;
@@ -2224,6 +2208,7 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
{
SpeechRecognizer? oldRecognizer;
SpeechRecognizer? newRecognizer = null;
+ IVoiceSpeechToTextRoute? speechToTextRoute;
VoiceSettings settings;
lock (_gate)
@@ -2234,6 +2219,7 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
}
oldRecognizer = _speechRecognizer;
+ speechToTextRoute = _speechToTextRoute;
settings = Clone(_settings.Voice);
_speechRecognizer = null;
_recognitionActive = false;
@@ -2256,7 +2242,12 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
try
{
cancellationToken.ThrowIfCancellationRequested();
- newRecognizer = await CreateSpeechRecognizerAsync(settings);
+ if (speechToTextRoute is not WindowsMediaSpeechToTextRoute windowsRoute)
+ {
+ throw new InvalidOperationException("Speech recognizer rebuild is only available for the Windows.Media STT route.");
+ }
+
+ newRecognizer = await windowsRoute.CreateRecognizerAsync(settings);
newRecognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
newRecognizer.ContinuousRecognitionSession.ResultGenerated += OnSpeechResultGenerated;
newRecognizer.ContinuousRecognitionSession.Completed += OnSpeechRecognitionCompleted;
@@ -2529,9 +2520,9 @@ private static VoiceStatusInfo Clone(VoiceStatusInfo source)
{
var fallbacks = new List<string>();
- if (!VoiceProviderCatalogService.SupportsWindowsRuntime(speechToTextProvider.Id))
+ if (!VoiceProviderCatalogService.SupportsSpeechToTextRuntime(speechToTextProvider.Id))
{
- fallbacks.Add($"STT '{speechToTextProvider.Name}' is not implemented yet; using Windows Speech Recognition.");
+ fallbacks.Add($"STT '{speechToTextProvider.Name}' is not implemented yet.");
}
if (!VoiceProviderCatalogService.SupportsTextToSpeechRuntime(textToSpeechProvider.Id))
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteFactory.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteFactory.cs
new file mode 100644
index 0000000..7a211fa
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteFactory.cs
@@ -0,0 +1,45 @@
+using System;
+using OpenClaw.Shared;
+
+namespace OpenClawTray.Services.Voice;
+
+internal static class VoiceSpeechToTextRouteFactory
+{
+ public static IVoiceSpeechToTextRoute Create(
+ VoiceProviderOption provider,
+ IOpenClawLogger logger)
+ {
+ ArgumentNullException.ThrowIfNull(provider);
+ ArgumentNullException.ThrowIfNull(logger);
+
+ return ResolveRouteKind(provider) switch
+ {
+ VoiceSpeechToTextRouteKind.WindowsMedia => new WindowsMediaSpeechToTextRoute(logger),
+ VoiceSpeechToTextRouteKind.Streaming => new AudioGraphStreamingSpeechToTextRoute(logger),
+ VoiceSpeechToTextRouteKind.SherpaOnnx => new SherpaOnnxSpeechToTextRoute(logger),
+ _ => new WindowsMediaSpeechToTextRoute(logger)
+ };
+ }
+
+ public static VoiceSpeechToTextRouteKind ResolveRouteKind(VoiceProviderOption provider)
+ {
+ ArgumentNullException.ThrowIfNull(provider);
+
+ if (string.Equals(provider.Id, VoiceProviderIds.SherpaOnnx, StringComparison.OrdinalIgnoreCase))
+ {
+ return VoiceSpeechToTextRouteKind.SherpaOnnx;
+ }
+
+ if (string.Equals(provider.Runtime, VoiceProviderRuntimeIds.Streaming, StringComparison.OrdinalIgnoreCase))
+ {
+ return VoiceSpeechToTextRouteKind.Streaming;
+ }
+
+ if (string.Equals(provider.Runtime, VoiceProviderRuntimeIds.Embedded, StringComparison.OrdinalIgnoreCase))
+ {
+ return VoiceSpeechToTextRouteKind.SherpaOnnx;
+ }
+
+ return VoiceSpeechToTextRouteKind.WindowsMedia;
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteKind.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteKind.cs
new file mode 100644
index 0000000..86871d8
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteKind.cs
@@ -0,0 +1,8 @@
+namespace OpenClawTray.Services.Voice;
+
+internal enum VoiceSpeechToTextRouteKind
+{
+ WindowsMedia,
+ Streaming,
+ SherpaOnnx
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteResources.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteResources.cs
new file mode 100644
index 0000000..c3b5f54
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceSpeechToTextRouteResources.cs
@@ -0,0 +1,9 @@
+using Windows.Media.SpeechRecognition;
+
+namespace OpenClawTray.Services.Voice;
+
+internal sealed class VoiceSpeechToTextRouteResources
+{
+ public VoiceCaptureService? CaptureService { get; init; }
+ public SpeechRecognizer? SpeechRecognizer { get; init; }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
new file mode 100644
index 0000000..caa3ad1
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
@@ -0,0 +1,51 @@
+using System;
+using System.Threading;
+using System.Threading.Tasks;
+using OpenClaw.Shared;
+using Windows.Media.SpeechRecognition;
+
+namespace OpenClawTray.Services.Voice;
+
+internal sealed class WindowsMediaSpeechToTextRoute : IVoiceSpeechToTextRoute
+{
+ private readonly IOpenClawLogger _logger;
+
+ public WindowsMediaSpeechToTextRoute(IOpenClawLogger logger)
+ {
+ _logger = logger;
+ }
+
+ public VoiceSpeechToTextRouteKind Kind => VoiceSpeechToTextRouteKind.WindowsMedia;
+
+ public async Task<VoiceSpeechToTextRouteResources> StartAsync(
+ VoiceProviderOption provider,
+ VoiceSettings settings,
+ CancellationToken cancellationToken)
+ {
+ cancellationToken.ThrowIfCancellationRequested();
+
+ return new VoiceSpeechToTextRouteResources
+ {
+ SpeechRecognizer = await CreateRecognizerAsync(settings)
+ };
+ }
+
+ public async Task<SpeechRecognizer> CreateRecognizerAsync(VoiceSettings settings)
+ {
+ var recognizer = new SpeechRecognizer();
+ recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromMilliseconds(settings.TalkMode.EndSilenceMs);
+ recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(10);
+ recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(4);
+ recognizer.Constraints.Add(new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.Dictation, "always-on-dictation"));
+
+ var compilation = await recognizer.CompileConstraintsAsync();
+ if (compilation.Status != SpeechRecognitionResultStatus.Success)
+ {
+ recognizer.Dispose();
+ throw new InvalidOperationException($"Speech recognizer unavailable: {compilation.Status}");
+ }
+
+ _logger.Info($"Speech recognizer compiled successfully ({compilation.Status})");
+ return recognizer;
+ }
+}
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 70047cf..559c96f 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -86,8 +86,12 @@ public void VoiceProviderCatalog_Defaults_ToEmptyLists()
public void VoiceProviderIds_ExposeRequiredBuiltInProviders()
{
Assert.Equal("windows", VoiceProviderIds.Windows);
+ Assert.Equal("foundry-local", VoiceProviderIds.FoundryLocal);
+ Assert.Equal("sherpa-onnx", VoiceProviderIds.SherpaOnnx);
Assert.Equal("minimax", VoiceProviderIds.MiniMax);
Assert.Equal("elevenlabs", VoiceProviderIds.ElevenLabs);
+ Assert.Equal("endpoint", VoiceProviderSettingKeys.Endpoint);
+ Assert.Equal("modelPath", VoiceProviderSettingKeys.ModelPath);
Assert.Equal("voiceSettingsJson", VoiceProviderSettingKeys.VoiceSettingsJson);
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index 34d5245..d0001dd 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -35,15 +35,26 @@ public void CatalogFilePath_ResolvesToExistingBundledAsset()
}
[Fact]
- public void LoadCatalog_IncludesBuiltInMiniMaxAndElevenLabsTtsProviders()
+ public void LoadCatalog_IncludesBuiltInSpeechAndTtsProviders()
{
var catalog = VoiceProviderCatalogService.LoadCatalog();
+ Assert.Contains(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.Windows);
+ Assert.Contains(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.FoundryLocal);
+ Assert.Contains(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.SherpaOnnx);
Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.Windows);
Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.ElevenLabs);
}
+ [Fact]
+ public void SupportsSpeechToTextRuntime_ReturnsTrueOnlyForWindowsMediaRoute()
+ {
+ Assert.True(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.Windows));
+ Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.FoundryLocal));
+ Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.SherpaOnnx));
+ }
+
[Fact]
public void SupportsTextToSpeechRuntime_ReturnsTrueForMiniMaxOnlyWhenImplemented()
{
@@ -57,6 +68,15 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
{
var catalog = VoiceProviderCatalogService.LoadCatalog();
+ var foundryLocal = Assert.Single(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.FoundryLocal);
+ Assert.Equal(VoiceProviderRuntimeIds.Streaming, foundryLocal.Runtime);
+ Assert.Equal("http://localhost:5273", foundryLocal.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Endpoint).DefaultValue);
+ Assert.Equal("whisper-tiny", foundryLocal.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model).DefaultValue);
+
+ var sherpaOnnx = Assert.Single(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.SherpaOnnx);
+ Assert.Equal(VoiceProviderRuntimeIds.Embedded, sherpaOnnx.Runtime);
+ Assert.Equal(string.Empty, sherpaOnnx.Settings.Single(s => s.Key == VoiceProviderSettingKeys.ModelPath).DefaultValue);
+
var minimax = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
Assert.Equal("MiniMax", minimax.Name);
Assert.NotNull(minimax.TextToSpeechWebSocket);
From 7d074d7d457d8e2810304a8915fb407decff1d89 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 17:37:37 +0000
Subject: [PATCH 74/83] Stop rebuilding Windows STT on idle timeouts
---
src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs | 5 ++---
tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs | 4 ++--
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 58916cb..4b4db06 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -1551,8 +1551,7 @@ internal static bool ShouldRebuildRecognitionAfterCompletion(
}
return sessionHadCaptureSignal ||
- status == SpeechRecognitionResultStatus.UserCanceled ||
- status == SpeechRecognitionResultStatus.TimeoutExceeded;
+ status == SpeechRecognitionResultStatus.UserCanceled;
}
internal static string DescribeRecognitionCompletionRebuildDecision(
@@ -1591,7 +1590,7 @@ internal static string DescribeRecognitionCompletionRebuildDecision(
return status switch
{
SpeechRecognitionResultStatus.UserCanceled => "user-canceled-without-activity",
- SpeechRecognitionResultStatus.TimeoutExceeded => "timeout-without-activity",
+ SpeechRecognitionResultStatus.TimeoutExceeded => "timeout-without-capture-signal",
_ => $"status={status}"
};
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 6429217..0bf1451 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -179,7 +179,7 @@ public void DescribeRecognitionCompletionRestartDecision_ExplainsWhyRestartIsBlo
[Theory]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, true)]
- [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, false)]
[InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, false)]
[InlineData(SpeechRecognitionResultStatus.Success, false, true, false, false, false, true)]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, true, false, false, false, false, false)]
@@ -209,7 +209,7 @@ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledS
[Theory]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, false, false, false, "capture-signal-without-recognition")]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, "user-canceled-without-activity")]
- [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, "timeout-without-activity")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, "timeout-without-capture-signal")]
[InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, "status=Success")]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, true, true, false, false, false, "session-had-activity")]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, true, false, false, "controlled-restart-in-progress")]
From 03e7e39643f658a4c17809e6944e3b5d9dfdc5a1 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 17:50:31 +0000
Subject: [PATCH 75/83] Catalog future STT providers without exposing them
---
docs/VOICE-MODE.md | 52 ++++++++++--
src/OpenClaw.Shared/VoiceModeSchema.cs | 11 +++
.../Assets/voice-providers.json | 80 ++++++++++++++++++-
.../Controls/VoiceSettingsPanel.xaml | 10 ++-
.../Controls/VoiceSettingsPanel.xaml.cs | 50 ++++++++++++
.../Voice/VoiceProviderCatalogService.cs | 4 +-
.../VoiceModeSchemaTests.cs | 14 ++++
.../VoiceProviderCatalogServiceTests.cs | 18 +++--
8 files changed, 222 insertions(+), 17 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 1d449b8..82c48ce 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -876,13 +876,55 @@ Notes:
- supported keys should at least include voice, model, and the documented voice-shaping parameters
- provider-specific validation should happen through the provider contract layer where possible
-### Story: Support non-local (or non-Windows, local) STT providers
+### Story: Foundry Local STT provider
-Allow the user to select a non-local STT provider like OpenAI Whisper, or a local non-Windows recognizer, instead of being locked to the Windows built-in path.
+Implement the AudioGraph-fed streaming STT adapter for Foundry Local.
-- Windows built-in local STT is working pretty well, however users should have the choice to utilise:
- - a non-local STT provider
- - a local non-Windows STT provider
+Notes:
+
+- provider metadata now lives in the provider catalog, but it should stay disabled in settings until the runtime adapter exists
+- this route should use the shared streaming STT path rather than the Windows.Media recognizer path
+- endpoint and model selection should come from the provider catalog settings contract
+
+### Story: OpenAI Whisper STT provider
+
+Implement the AudioGraph-fed streaming STT adapter for OpenAI Whisper transcription.
+
+Notes:
+
+- this should be catalog-driven and disabled in settings until the adapter is production-ready
+- the initial implementation only needs the basic transcription path, not translation or diarization
+- API key and model configuration should come from the provider catalog
+
+### Story: ElevenLabs Speech to Text provider
+
+Implement the AudioGraph-fed streaming STT adapter for ElevenLabs speech-to-text.
+
+Notes:
+
+- keep it catalog-driven and disabled in settings until the runtime path is implemented
+- match the same route abstraction used by the other non-Windows STT providers
+- any provider-specific partial/final transcript semantics should be normalized in the adapter layer
+
+### Story: Azure AI Speech STT provider
+
+Implement the AudioGraph-fed streaming STT adapter for Azure AI Speech.
+
+Notes:
+
+- use the official Azure AI Speech naming in settings and docs rather than an internal "Foundry Azure STT" label
+- keep the provider catalog entry disabled until the adapter is functional end to end
+- endpoint and credential handling should come from the provider settings contract
+
+### Story: sherpa-onnx embedded STT provider
+
+Implement the local embedded sherpa-onnx STT route for user-supplied model bundles.
+
+Notes:
+
+- keep this visible but greyed out in settings until the embedded runtime is implemented
+- the user should be able to choose their own downloaded model bundle and language-appropriate package
+- model lifecycle, validation, and error reporting should be handled in the embedded adapter rather than in the Windows.Media route
### Story: Full-duplex / barge-in Talk Mode
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 62ab03b..7281ff4 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -155,6 +155,9 @@ public static class VoiceProviderIds
{
public const string Windows = "windows";
public const string FoundryLocal = "foundry-local";
+ public const string OpenAiWhisper = "openai-whisper";
+ public const string ElevenLabsSpeechToText = "elevenlabs-stt";
+ public const string AzureAiSpeech = "azure-ai-speech";
public const string SherpaOnnx = "sherpa-onnx";
public const string MiniMax = "minimax";
public const string ElevenLabs = "elevenlabs";
@@ -263,10 +266,18 @@ public sealed class VoiceProviderOption
public string Name { get; set; } = "";
public string Runtime { get; set; } = VoiceProviderRuntimeIds.Windows;
public bool Enabled { get; set; } = true;
+ public bool VisibleInSettings { get; set; } = true;
+ public bool Selectable { get; set; } = true;
public string? Description { get; set; }
public List<VoiceProviderSettingDefinition> Settings { get; set; } = [];
public VoiceTextToSpeechHttpContract? TextToSpeechHttp { get; set; }
public VoiceTextToSpeechWebSocketContract? TextToSpeechWebSocket { get; set; }
+
+ [JsonIgnore]
+ public string DisplayName => Selectable ? Name : $"{Name} (coming soon)";
+
+ [JsonIgnore]
+ public double DisplayOpacity => Selectable ? 1.0 : 0.55;
}
public sealed class VoiceProviderCatalog
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index 5bd32cd..589fd06 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -11,7 +11,9 @@
"id": "foundry-local",
"name": "Foundry Local",
"runtime": "streaming",
- "enabled": true,
+ "enabled": false,
+ "visibleInSettings": false,
+ "selectable": false,
"description": "AudioGraph-fed streaming STT route for Foundry Local or compatible streaming adapters.",
"settings": [
{
@@ -32,11 +34,85 @@
}
]
},
+ {
+ "id": "openai-whisper",
+ "name": "OpenAI Whisper",
+ "runtime": "streaming",
+ "enabled": false,
+ "visibleInSettings": false,
+ "selectable": false,
+ "description": "AudioGraph-fed cloud STT route for the OpenAI Whisper transcription API.",
+ "settings": [
+ {
+ "key": "apiKey",
+ "label": "API key",
+ "secret": true
+ },
+ {
+ "key": "model",
+ "label": "Model",
+ "required": false,
+ "defaultValue": "whisper-1",
+ "placeholder": "whisper-1",
+ "description": "Transcription model identifier for the OpenAI speech-to-text adapter."
+ }
+ ]
+ },
+ {
+ "id": "elevenlabs-stt",
+ "name": "ElevenLabs Speech to Text",
+ "runtime": "streaming",
+ "enabled": false,
+ "visibleInSettings": false,
+ "selectable": false,
+ "description": "AudioGraph-fed cloud STT route for the ElevenLabs speech-to-text API.",
+ "settings": [
+ {
+ "key": "apiKey",
+ "label": "API key",
+ "secret": true
+ },
+ {
+ "key": "model",
+ "label": "Model",
+ "required": false,
+ "defaultValue": "scribe_v1",
+ "placeholder": "scribe_v1",
+ "description": "Transcription model identifier for the ElevenLabs speech-to-text adapter."
+ }
+ ]
+ },
+ {
+ "id": "azure-ai-speech",
+ "name": "Azure AI Speech",
+ "runtime": "streaming",
+ "enabled": false,
+ "visibleInSettings": false,
+ "selectable": false,
+ "description": "AudioGraph-fed cloud STT route for Azure AI Speech real-time transcription.",
+ "settings": [
+ {
+ "key": "apiKey",
+ "label": "API key",
+ "secret": true
+ },
+ {
+ "key": "endpoint",
+ "label": "Endpoint",
+ "required": false,
+ "defaultValue": "",
+ "placeholder": "https://your-speech-resource.cognitiveservices.azure.com",
+ "description": "Azure AI Speech endpoint for the streaming STT adapter."
+ }
+ ]
+ },
{
"id": "sherpa-onnx",
"name": "sherpa-onnx",
"runtime": "embedded",
- "enabled": true,
+ "enabled": false,
+ "visibleInSettings": true,
+ "selectable": false,
"description": "Local embedded STT route for user-supplied sherpa-onnx model bundles.",
"settings": [
{
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index 7353299..5136e21 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -4,6 +4,12 @@
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">
+ <UserControl.Resources>
+ <DataTemplate x:Key="VoiceProviderOptionTemplate">
+ <TextBlock Text="{Binding DisplayName}" Opacity="{Binding DisplayOpacity}"/>
+ </DataTemplate>
+ </UserControl.Resources>
+
<StackPanel Spacing="8">
<TextBlock Text="VOICE" Style="{StaticResource CaptionTextBlockStyle}"
Foreground="#E74C3C" FontWeight="Bold"/>
@@ -16,11 +22,11 @@
<ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
Header="Speech to text provider"
- DisplayMemberPath="Name"
+ ItemTemplate="{StaticResource VoiceProviderOptionTemplate}"
SelectionChanged="OnVoiceProviderChanged"/>
<ComboBox x:Name="VoiceTextToSpeechProviderComboBox"
Header="Text to speech provider"
- DisplayMemberPath="Name"
+ ItemTemplate="{StaticResource VoiceProviderOptionTemplate}"
SelectionChanged="OnVoiceProviderChanged"/>
<StackPanel x:Name="VoiceTtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 2345543..ade8bf2 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -15,6 +15,7 @@ public sealed partial class VoiceSettingsPanel : UserControl
private SettingsManager? _settings;
private IVoiceConfigurationApi? _voiceConfigurationApi;
private VoiceProviderConfigurationStore _voiceProviderConfigurationDraft = new();
+ private string _activeSttProviderId = VoiceProviderIds.Windows;
private string _activeTtsProviderId = VoiceProviderIds.Windows;
private bool _updatingVoiceProviderFields;
private List<VoiceProviderOption> _speechToTextOptions = new();
@@ -120,6 +121,9 @@ private void LoadVoiceProviders()
VoiceTextToSpeechProviderComboBox.SelectedItem =
_textToSpeechOptions.FirstOrDefault(p => p.Id == _settings!.Voice.TextToSpeechProviderId)
?? _textToSpeechOptions.FirstOrDefault();
+
+ _ = EnsureSelectableProviderSelection(VoiceSpeechToTextProviderComboBox, _speechToTextOptions, ref _activeSttProviderId);
+ _ = EnsureSelectableProviderSelection(VoiceTextToSpeechProviderComboBox, _textToSpeechOptions, ref _activeTtsProviderId);
}
private async Task LoadVoiceDevicesAsync()
@@ -345,6 +349,18 @@ private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
{
+ if (ReferenceEquals(sender, VoiceSpeechToTextProviderComboBox) &&
+ !EnsureSelectableProviderSelection(VoiceSpeechToTextProviderComboBox, _speechToTextOptions, ref _activeSttProviderId))
+ {
+ return;
+ }
+
+ if (ReferenceEquals(sender, VoiceTextToSpeechProviderComboBox) &&
+ !EnsureSelectableProviderSelection(VoiceTextToSpeechProviderComboBox, _textToSpeechOptions, ref _activeTtsProviderId))
+ {
+ return;
+ }
+
CaptureSelectedVoiceProviderSettings();
UpdateVoiceProviderSettingsEditor();
UpdateVoiceSettingsInfo();
@@ -407,6 +423,8 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
Name = source.Name,
Runtime = source.Runtime,
Enabled = source.Enabled,
+ VisibleInSettings = source.VisibleInSettings,
+ Selectable = source.Selectable,
Description = source.Description,
Settings = source.Settings
.Select(setting => new VoiceProviderSettingDefinition
@@ -464,4 +482,36 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
}
};
}
+
+ private static bool EnsureSelectableProviderSelection(
+ ComboBox comboBox,
+ IReadOnlyList<VoiceProviderOption> options,
+ ref string activeProviderId)
+ {
+ var previousProviderId = activeProviderId;
+
+ if (comboBox.SelectedItem is VoiceProviderOption selected && selected.Selectable)
+ {
+ activeProviderId = selected.Id;
+ return true;
+ }
+
+ var fallback = options.FirstOrDefault(option =>
+ option.Selectable &&
+ string.Equals(option.Id, previousProviderId, StringComparison.OrdinalIgnoreCase))
+ ?? options.FirstOrDefault(option => option.Selectable);
+
+ if (fallback == null)
+ {
+ return false;
+ }
+
+ if (!ReferenceEquals(comboBox.SelectedItem, fallback))
+ {
+ comboBox.SelectedItem = fallback;
+ }
+
+ activeProviderId = fallback.Id;
+ return false;
+ }
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index 9516a20..41e26dd 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -123,7 +123,7 @@ private static List<VoiceProviderOption> NormalizeProviders(List<VoiceProviderOp
return providers
.Where(p => !string.IsNullOrWhiteSpace(p.Id))
.Select(Clone)
- .Where(p => p.Enabled)
+ .Where(p => p.Enabled || p.VisibleInSettings)
.OrderByDescending(p => string.Equals(p.Id, VoiceProviderIds.Windows, StringComparison.OrdinalIgnoreCase))
.ThenBy(p => p.Name, StringComparer.OrdinalIgnoreCase)
.ToList();
@@ -159,6 +159,8 @@ private static VoiceProviderOption Clone(VoiceProviderOption source)
Name = source.Name,
Runtime = source.Runtime,
Enabled = source.Enabled,
+ VisibleInSettings = source.VisibleInSettings,
+ Selectable = source.Selectable,
Description = source.Description,
Settings = source.Settings.Select(Clone).ToList(),
TextToSpeechHttp = Clone(source.TextToSpeechHttp),
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 559c96f..37093be 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -87,6 +87,9 @@ public void VoiceProviderIds_ExposeRequiredBuiltInProviders()
{
Assert.Equal("windows", VoiceProviderIds.Windows);
Assert.Equal("foundry-local", VoiceProviderIds.FoundryLocal);
+ Assert.Equal("openai-whisper", VoiceProviderIds.OpenAiWhisper);
+ Assert.Equal("elevenlabs-stt", VoiceProviderIds.ElevenLabsSpeechToText);
+ Assert.Equal("azure-ai-speech", VoiceProviderIds.AzureAiSpeech);
Assert.Equal("sherpa-onnx", VoiceProviderIds.SherpaOnnx);
Assert.Equal("minimax", VoiceProviderIds.MiniMax);
Assert.Equal("elevenlabs", VoiceProviderIds.ElevenLabs);
@@ -95,6 +98,17 @@ public void VoiceProviderIds_ExposeRequiredBuiltInProviders()
Assert.Equal("voiceSettingsJson", VoiceProviderSettingKeys.VoiceSettingsJson);
}
+ [Fact]
+ public void VoiceProviderOption_Defaults_ToVisibleAndSelectable()
+ {
+ var option = new VoiceProviderOption { Name = "Provider" };
+
+ Assert.True(option.VisibleInSettings);
+ Assert.True(option.Selectable);
+ Assert.Equal("Provider", option.DisplayName);
+ Assert.Equal(1.0, option.DisplayOpacity);
+ }
+
[Fact]
public void VoiceProviderConfigurationStore_Defaults_ToEmptyProviders()
{
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index d0001dd..cac28f6 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -35,13 +35,16 @@ public void CatalogFilePath_ResolvesToExistingBundledAsset()
}
[Fact]
- public void LoadCatalog_IncludesBuiltInSpeechAndTtsProviders()
+ public void LoadCatalog_IncludesOnlySelectableAndVisibleSpeechProviders()
{
var catalog = VoiceProviderCatalogService.LoadCatalog();
Assert.Contains(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.Windows);
- Assert.Contains(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.FoundryLocal);
Assert.Contains(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.SherpaOnnx);
+ Assert.DoesNotContain(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.FoundryLocal);
+ Assert.DoesNotContain(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.OpenAiWhisper);
+ Assert.DoesNotContain(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.ElevenLabsSpeechToText);
+ Assert.DoesNotContain(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.AzureAiSpeech);
Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.Windows);
Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
Assert.Contains(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.ElevenLabs);
@@ -52,6 +55,9 @@ public void SupportsSpeechToTextRuntime_ReturnsTrueOnlyForWindowsMediaRoute()
{
Assert.True(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.Windows));
Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.FoundryLocal));
+ Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.OpenAiWhisper));
+ Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.ElevenLabsSpeechToText));
+ Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.AzureAiSpeech));
Assert.False(VoiceProviderCatalogService.SupportsSpeechToTextRuntime(VoiceProviderIds.SherpaOnnx));
}
@@ -68,13 +74,11 @@ public void LoadCatalog_ExposesBuiltInCloudTtsContracts()
{
var catalog = VoiceProviderCatalogService.LoadCatalog();
- var foundryLocal = Assert.Single(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.FoundryLocal);
- Assert.Equal(VoiceProviderRuntimeIds.Streaming, foundryLocal.Runtime);
- Assert.Equal("http://localhost:5273", foundryLocal.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Endpoint).DefaultValue);
- Assert.Equal("whisper-tiny", foundryLocal.Settings.Single(s => s.Key == VoiceProviderSettingKeys.Model).DefaultValue);
-
var sherpaOnnx = Assert.Single(catalog.SpeechToTextProviders, p => p.Id == VoiceProviderIds.SherpaOnnx);
Assert.Equal(VoiceProviderRuntimeIds.Embedded, sherpaOnnx.Runtime);
+ Assert.False(sherpaOnnx.Enabled);
+ Assert.True(sherpaOnnx.VisibleInSettings);
+ Assert.False(sherpaOnnx.Selectable);
Assert.Equal(string.Empty, sherpaOnnx.Settings.Single(s => s.Key == VoiceProviderSettingKeys.ModelPath).DefaultValue);
var minimax = Assert.Single(catalog.TextToSpeechProviders, p => p.Id == VoiceProviderIds.MiniMax);
From ad471b23bc78737b08b92aa70a5508f78cf0c9e6 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 22:30:03 +0000
Subject: [PATCH 76/83] Polish talk mode recovery and settings
---
docs/VOICE-MODE.md | 45 ++-
src/OpenClaw.Shared/VoiceModeSchema.cs | 1 +
.../Assets/voice-providers.json | 13 +-
.../Controls/VoiceSettingsPanel.xaml | 62 +++-
.../Controls/VoiceSettingsPanel.xaml.cs | 50 +++
.../Services/Voice/VoiceChatContracts.cs | 1 +
.../Services/Voice/VoiceChatCoordinator.cs | 41 +++
.../Services/Voice/VoiceService.cs | 330 +++++-------------
.../Voice/WindowsMediaSpeechToTextRoute.cs | 10 +-
.../Windows/WebChatWindow.xaml.cs | 93 +++++
.../VoiceChatCoordinatorTests.cs | 55 +++
.../VoiceServiceTransportTests.cs | 110 +++---
12 files changed, 489 insertions(+), 322 deletions(-)
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 82c48ce..a97d024 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -51,8 +51,9 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
`TalkMode` follows the current talk-mode style control flow:
- the node captures audio locally
-- local or remote speech recognition turns that audio into transcript text
+- local speech recognition turns that audio into transcript text on the active STT route
- interim hypotheses are surfaced live, but only final `Medium` or `High` confidence recognizer results are submitted
+- if speech activity ends without any usable final transcript surviving, Talk Mode now clears the draft and gives a short local repeat prompt instead of silently doing nothing
- the tray chat window, when open, mirrors the live transcript draft locally
- the finalized transcript is always sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
@@ -227,13 +228,29 @@ The built-in default for both is `windows`.
Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
+- the `windows` STT route is a pure `Windows.Media.SpeechRecognition.SpeechRecognizer` path with no `AudioGraph` dependency
+- `windows` STT is currently treated as `half-duplex, non-streamed`
+- `http/ws` is now catalogued as a visible "coming soon" STT slot for generic streaming HTTP/WebSocket adapters
- built-in catalog entries exist for both `minimax` and `elevenlabs` TTS
- `minimax` defaults to `speech-2.8-turbo` and `English_MatureBoss` at present
- `minimax` now uses a catalog-driven WebSocket contract for synchronous TTS
- `elevenlabs` defaults to `eleven_multilingual_v2` and voice id `6aDn1KB0hjpdcocrUkmq (Tiffany)` for now
-- non-Windows providers can be selected and persisted now
+- only currently usable providers are selectable in Settings
+- `sherpa-onnx` is visible but greyed out as a coming-soon local embedded route
- unsupported providers fall back to Windows at runtime with a status warning
+### Settings Surface Notes
+
+The Settings panel now shows short inline descriptions for:
+
+- the selected voice mode
+- the selected speech-to-text provider
+- the selected text-to-speech provider
+
+Those provider descriptions are drawn directly from the provider catalog.
+
+When `Windows Speech Recognition` is selected for STT, the Settings panel now forces both audio device pickers back to the system defaults and greys them out. That matches the current Windows route limitation and avoids advertising per-device microphone routing that does not exist on this route yet.
+
### Provider Catalog
The provider catalog now ships with the tray app as a bundled asset:
@@ -250,7 +267,16 @@ Example:
"name": "Windows Speech Recognition",
"runtime": "windows",
"enabled": true,
- "description": "Built-in Windows dictation and speech recognition."
+ "description": "Built-in Windows.Media speech recognition, half-duplex, non-streamed."
+ },
+ {
+ "id": "http-ws",
+ "name": "http/ws",
+ "runtime": "streaming",
+ "enabled": false,
+ "visibleInSettings": true,
+ "selectable": false,
+ "description": "Will support most cloud and local stand-alone models full or half-duplex, streaming."
},
],
"textToSpeechProviders": [
@@ -819,7 +845,7 @@ Status values used below:
| Voice Wake forwarding to the active gateway / agent | `NotSupported (planned)` | Forwarding semantics are only implemented for Talk Mode today. |
| Voice Wake machine-hint transcript prefixing | `NotSupported (planned)` | Windows does not currently prepend a machine hint on forwarded wake transcripts. |
| Voice Wake mic picker, live level meter, trigger-word table, and tester | `NotSupported (planned)` | Windows has general voice settings and device lists, but not the Voice Wake-specific settings surface from macOS. |
-| Voice mic device selection | `Partial` | Selected output device is implemented; selected microphone binding exists in `AudioGraph`, but actual transcript generation still follows the Windows speech-input path. |
+| Voice mic device selection | `Partial` | When `Windows Speech Recognition` is selected, Settings now locks both audio device pickers to the system defaults. Explicit per-device transcription routing remains a future AudioGraph/streaming-route feature. |
| Voice Wake send / trigger chimes | `NotSupported (planned)` | Windows currently has no configurable trigger/send sounds. |
## Feature List (Backlog)
@@ -834,6 +860,16 @@ Notes:
- final transcript generation still follows the Windows speech-input path rather than the selected device id
- the implementation should complete the planned `AudioGraph` -> `ISpeechToTextAdapter` migration so the chosen microphone controls the whole input pipeline
+### Story: Generic http/ws streaming STT provider
+
+Implement the catalog-driven generic HTTP/WebSocket STT adapter shown in Settings as `http/ws (coming soon)`.
+
+Notes:
+
+- this route is intended to support a broad class of stand-alone cloud or local streaming models
+- it should remain visible but disabled in Settings until the adapter contract is real and testable
+- it should become the shared base path for provider-specific adapters when a generic contract is sufficient
+
### Story: Talk Mode overlay and visible phase parity
Add a Talk Mode overlay that makes `Listening`, `Thinking`, and `Speaking` visible to the user in the same way the macOS experience does.
@@ -1120,3 +1156,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-26` Merged the latest `feature/voice-mode` updates and added explicit recognizer completion decision logging so Talk Mode now records why a stopped Windows speech session did or did not restart/rebuild after idle or watchdog completions.
- `2026-03-26` Cleared the controlled-restart latch as soon as a new recognition session successfully starts, and on failed resume attempts, so Talk Mode does not get stranded in `controlled-restart-in-progress` after an idle/deaf recognizer recycle.
- `2026-03-26` Split Talk Mode STT startup into explicit route classes: the built-in `windows` provider now uses a pure `Windows.Media` path with no AudioGraph, while `foundry-local` and `sherpa-onnx` are separated into dedicated future AudioGraph/embedded route classes instead of sharing the Windows pipeline.
+- `2026-03-26` Updated Talk Mode to mirror sent voice turns back into the tray chat window, clear stale low-confidence drafts, locally reprompt when speech activity ends without a usable transcript, surface provider and mode descriptions in Settings, expose `http/ws` and `sherpa-onnx` as coming-soon STT catalog entries, and lock device selection to system defaults when Windows Speech Recognition is selected.
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 7281ff4..6ba5202 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -154,6 +154,7 @@ public sealed class VoiceSettingsUpdateArgs
public static class VoiceProviderIds
{
public const string Windows = "windows";
+ public const string HttpWs = "http-ws";
public const string FoundryLocal = "foundry-local";
public const string OpenAiWhisper = "openai-whisper";
public const string ElevenLabsSpeechToText = "elevenlabs-stt";
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
index 589fd06..3ffcc0b 100644
--- a/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
+++ b/src/OpenClaw.Tray.WinUI/Assets/voice-providers.json
@@ -5,7 +5,16 @@
"name": "Windows Speech Recognition",
"runtime": "windows",
"enabled": true,
- "description": "Built-in Windows.Media speech recognition without AudioGraph."
+ "description": "Built-in Windows.Media speech recognition, half-duplex, non-streamed."
+ },
+ {
+ "id": "http-ws",
+ "name": "http/ws",
+ "runtime": "streaming",
+ "enabled": false,
+ "visibleInSettings": true,
+ "selectable": false,
+ "description": "Will support most cloud and local stand-alone models full or half-duplex, streaming."
},
{
"id": "foundry-local",
@@ -113,7 +122,7 @@
"enabled": false,
"visibleInSettings": true,
"selectable": false,
- "description": "Local embedded STT route for user-supplied sherpa-onnx model bundles.",
+ "description": "Can load a variety of models including OpenAI/Whisper, full-duplex, streaming.",
"settings": [
{
"key": "modelPath",
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index 5136e21..fa6721f 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -14,20 +14,56 @@
<TextBlock Text="VOICE" Style="{StaticResource CaptionTextBlockStyle}"
Foreground="#E74C3C" FontWeight="Bold"/>
- <ComboBox x:Name="VoiceModeComboBox" Header="Mode" SelectionChanged="OnVoiceModeChanged">
- <ComboBoxItem Content="Off" Tag="Off"/>
- <ComboBoxItem Content="Voice Wake" Tag="VoiceWake" IsEnabled="False"/>
- <ComboBoxItem Content="Talk Mode" Tag="TalkMode"/>
- </ComboBox>
+ <Grid ColumnSpacing="12">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="2*"/>
+ <ColumnDefinition Width="3*"/>
+ </Grid.ColumnDefinitions>
+ <ComboBox x:Name="VoiceModeComboBox" Header="Mode" SelectionChanged="OnVoiceModeChanged">
+ <ComboBoxItem Content="Off" Tag="Off"/>
+ <ComboBoxItem Content="Voice Wake" Tag="VoiceWake" IsEnabled="False"/>
+ <ComboBoxItem Content="Talk Mode" Tag="TalkMode"/>
+ </ComboBox>
+ <TextBlock x:Name="VoiceModeDescriptionTextBlock"
+ Grid.Column="1"
+ VerticalAlignment="Center"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </Grid>
- <ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
- Header="Speech to text provider"
- ItemTemplate="{StaticResource VoiceProviderOptionTemplate}"
- SelectionChanged="OnVoiceProviderChanged"/>
- <ComboBox x:Name="VoiceTextToSpeechProviderComboBox"
- Header="Text to speech provider"
- ItemTemplate="{StaticResource VoiceProviderOptionTemplate}"
- SelectionChanged="OnVoiceProviderChanged"/>
+ <Grid ColumnSpacing="12">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="2*"/>
+ <ColumnDefinition Width="3*"/>
+ </Grid.ColumnDefinitions>
+ <ComboBox x:Name="VoiceSpeechToTextProviderComboBox"
+ Header="Speech to text provider"
+ ItemTemplate="{StaticResource VoiceProviderOptionTemplate}"
+ SelectionChanged="OnVoiceProviderChanged"/>
+ <TextBlock x:Name="VoiceSpeechToTextProviderDescriptionTextBlock"
+ Grid.Column="1"
+ VerticalAlignment="Center"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </Grid>
+ <Grid ColumnSpacing="12">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="2*"/>
+ <ColumnDefinition Width="3*"/>
+ </Grid.ColumnDefinitions>
+ <ComboBox x:Name="VoiceTextToSpeechProviderComboBox"
+ Header="Text to speech provider"
+ ItemTemplate="{StaticResource VoiceProviderOptionTemplate}"
+ SelectionChanged="OnVoiceProviderChanged"/>
+ <TextBlock x:Name="VoiceTextToSpeechProviderDescriptionTextBlock"
+ Grid.Column="1"
+ VerticalAlignment="Center"
+ Style="{StaticResource CaptionTextBlockStyle}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ TextWrapping="Wrap"/>
+ </Grid>
<StackPanel x:Name="VoiceTtsProviderSettingsPanel" Spacing="8" Visibility="Collapsed">
<TextBlock x:Name="VoiceTtsProviderSettingsTitleTextBlock"
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index ade8bf2..73c1db1 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -95,6 +95,7 @@ private void LoadVoiceSettings()
_voiceProviderConfigurationDraft = _settings.VoiceProviderConfiguration.Clone();
LoadVoiceProviders();
SelectVoiceMode(_settings.Voice.Mode);
+ UpdateVoiceSelectionDescriptions();
VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
VoiceStripInjectedMemoriesCheckBox.IsChecked = _settings.Voice.StripInjectedMemoriesInChat;
UpdateVoiceProviderSettingsEditor();
@@ -124,6 +125,8 @@ private void LoadVoiceProviders()
_ = EnsureSelectableProviderSelection(VoiceSpeechToTextProviderComboBox, _speechToTextOptions, ref _activeSttProviderId);
_ = EnsureSelectableProviderSelection(VoiceTextToSpeechProviderComboBox, _textToSpeechOptions, ref _activeTtsProviderId);
+ UpdateVoiceSelectionDescriptions();
+ UpdateDeviceSelectionAvailability();
}
private async Task LoadVoiceDevicesAsync()
@@ -160,6 +163,7 @@ private async Task LoadVoiceDevicesAsync()
VoiceInputDeviceComboBox.SelectedItem = _inputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.InputDeviceId) ?? _inputOptions[0];
VoiceOutputDeviceComboBox.SelectedItem = _outputOptions.FirstOrDefault(o => o.DeviceId == _settings.Voice.OutputDeviceId) ?? _outputOptions[0];
+ UpdateDeviceSelectionAvailability();
UpdateVoiceSettingsInfo();
}
catch (Exception ex)
@@ -200,6 +204,25 @@ private VoiceActivationMode GetSelectedVoiceMode()
};
}
+ private void UpdateVoiceSelectionDescriptions()
+ {
+ VoiceModeDescriptionTextBlock.Text = GetVoiceModeDescription(GetSelectedVoiceMode());
+ VoiceSpeechToTextProviderDescriptionTextBlock.Text =
+ (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Description ?? string.Empty;
+ VoiceTextToSpeechProviderDescriptionTextBlock.Text =
+ (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Description ?? string.Empty;
+ }
+
+ private static string GetVoiceModeDescription(VoiceActivationMode mode)
+ {
+ return mode switch
+ {
+ VoiceActivationMode.TalkMode => "Continuous conversation mode. Listen after replies and send each completed utterance as a chat turn.",
+ VoiceActivationMode.VoiceWake => "Wake-word mode. Stays idle until the hotword is detected, then starts listening for a request.",
+ _ => "Voice features stay off until you start them manually."
+ };
+ }
+
private void UpdateVoiceSettingsInfo()
{
var stt = (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Name ?? "Windows Speech Recognition";
@@ -224,6 +247,30 @@ private void UpdateVoiceSettingsInfo()
$"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}. Chat cleanup: {(VoiceStripInjectedMemoriesCheckBox.IsChecked ?? true ? "on" : "off")}.{fallbackNotice}";
}
+ private void UpdateDeviceSelectionAvailability()
+ {
+ var lockToDefaultDevices = string.Equals(
+ (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Id,
+ VoiceProviderIds.Windows,
+ StringComparison.OrdinalIgnoreCase);
+
+ if (lockToDefaultDevices)
+ {
+ if (_inputOptions.Count > 0)
+ {
+ VoiceInputDeviceComboBox.SelectedItem = _inputOptions[0];
+ }
+
+ if (_outputOptions.Count > 0)
+ {
+ VoiceOutputDeviceComboBox.SelectedItem = _outputOptions[0];
+ }
+ }
+
+ VoiceInputDeviceComboBox.IsEnabled = !lockToDefaultDevices;
+ VoiceOutputDeviceComboBox.IsEnabled = !lockToDefaultDevices;
+ }
+
private void UpdateVoiceProviderSettingsEditor()
{
var providerId = GetSelectedTextToSpeechProviderId();
@@ -344,6 +391,7 @@ private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
{
+ UpdateVoiceSelectionDescriptions();
UpdateVoiceSettingsInfo();
}
@@ -362,6 +410,8 @@ private void OnVoiceProviderChanged(object sender, SelectionChangedEventArgs e)
}
CaptureSelectedVoiceProviderSettings();
+ UpdateVoiceSelectionDescriptions();
+ UpdateDeviceSelectionAvailability();
UpdateVoiceProviderSettingsEditor();
UpdateVoiceSettingsInfo();
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
index 2450b97..88d5a48 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatContracts.cs
@@ -55,4 +55,5 @@ public interface IVoiceChatWindow
{
bool IsClosed { get; }
Task UpdateVoiceTranscriptDraftAsync(string text, bool clear);
+ Task AppendVoiceConversationTurnAsync(VoiceConversationTurnEventArgs args);
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
index 1ac8b51..9a9b52b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
@@ -1,17 +1,20 @@
using OpenClaw.Shared;
using System;
+using System.Collections.Generic;
using System.Threading.Tasks;
namespace OpenClawTray.Services.Voice;
public sealed class VoiceChatCoordinator : IDisposable
{
+ private const int MaxBufferedConversationTurns = 8;
private readonly IVoiceRuntime _voiceService;
private readonly IUiDispatcher _dispatcher;
private readonly object _gate = new();
private IVoiceChatWindow? _webChatWindow;
private string _voiceTranscriptDraftText = string.Empty;
+ private readonly List<VoiceConversationTurnEventArgs> _bufferedConversationTurns = [];
private bool _disposed;
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
@@ -44,6 +47,17 @@ public void AttachWindow(IVoiceChatWindow window)
_ = window.UpdateVoiceTranscriptDraftAsync(
_voiceTranscriptDraftText,
clear: string.IsNullOrWhiteSpace(_voiceTranscriptDraftText));
+
+ List<VoiceConversationTurnEventArgs> bufferedTurns;
+ lock (_gate)
+ {
+ bufferedTurns = [.. _bufferedConversationTurns];
+ }
+
+ foreach (var turn in bufferedTurns)
+ {
+ _ = window.AppendVoiceConversationTurnAsync(turn);
+ }
}
public void DetachWindow(IVoiceChatWindow? window)
@@ -81,6 +95,23 @@ private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationT
{
_dispatcher.TryEnqueue(() =>
{
+ IVoiceChatWindow? window;
+ lock (_gate)
+ {
+ _bufferedConversationTurns.Add(CloneTurn(args));
+ if (_bufferedConversationTurns.Count > MaxBufferedConversationTurns)
+ {
+ _bufferedConversationTurns.RemoveAt(0);
+ }
+
+ window = _webChatWindow;
+ }
+
+ if (window != null && !window.IsClosed)
+ {
+ _ = window.AppendVoiceConversationTurnAsync(args);
+ }
+
ConversationTurnAvailable?.Invoke(this, args);
});
}
@@ -105,4 +136,14 @@ private void OnVoiceTranscriptDraftUpdated(object? sender, VoiceTranscriptDraftE
_ = window.UpdateVoiceTranscriptDraftAsync(_voiceTranscriptDraftText, args.Clear);
});
}
+
+ private static VoiceConversationTurnEventArgs CloneTurn(VoiceConversationTurnEventArgs args)
+ {
+ return new VoiceConversationTurnEventArgs
+ {
+ Direction = args.Direction,
+ Message = args.Message,
+ SessionKey = args.SessionKey
+ };
+ }
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 4b4db06..350f005 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -30,14 +30,11 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
- private static readonly TimeSpan RecognitionSpeechMismatchDelay = TimeSpan.FromSeconds(4);
- private static readonly TimeSpan RecognitionPostSpeechSilenceBeforeRecycle = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
private static readonly TimeSpan HypothesisPromotionWindow = TimeSpan.FromSeconds(2);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
- private const int RecognitionSignalBurstThreshold = 4;
- private const float RecognitionSignalPeakThreshold = 0.03f;
+ private const string LowConfidenceRepeatPrompt = "Sorry, I didn't catch that. Could you say it again?";
private readonly IOpenClawLogger _logger;
private readonly SettingsManager _settings;
@@ -59,10 +56,7 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private int _recognitionSessionGeneration;
private bool _recognitionSessionHadActivity;
private bool _recognitionSessionHadCaptureSignal;
- private bool _recognitionHealthCheckArmed;
private bool _recognitionRestartInProgress;
- private int _recognitionSignalBurstCount;
- private DateTime _lastCaptureSignalUtc;
private bool _awaitingReply;
private bool _isSpeaking;
private bool _replyPlaybackLoopActive;
@@ -532,11 +526,6 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
player = new MediaPlayer();
await ConfigurePlaybackOutputDeviceAsync(player, settings);
- if (captureService != null)
- {
- captureService.SignalDetected += OnCaptureSignalDetected;
- }
-
if (recognizer != null)
{
recognizer.HypothesisGenerated += OnSpeechHypothesisGenerated;
@@ -579,7 +568,6 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
{
if (captureService != null)
{
- captureService.SignalDetected -= OnCaptureSignalDetected;
try { await captureService.StopAsync(); } catch { }
try { await captureService.DisposeAsync(); } catch { }
}
@@ -732,9 +720,6 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
runtimeToken = _runtimeCts.Token;
generation = ++_recognitionSessionGeneration;
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
- _recognitionHealthCheckArmed = false;
}
_logger.Info("Starting speech recognition session");
@@ -811,8 +796,11 @@ private async Task MonitorListeningReadyAsync(int generation, CancellationToken
if (transitionedToListening)
{
+ var readinessSource = captureService == null
+ ? "recognizer warm-up completed"
+ : "capture frames observed and recognizer warm-up completed";
_logger.Info(
- $"Speech pipeline ready; capture frames observed and recognizer warm-up completed ({InitialRecognitionReadyDelay.TotalMilliseconds:0}ms)");
+ $"Speech pipeline ready; {readinessSource} ({InitialRecognitionReadyDelay.TotalMilliseconds:0}ms)");
}
}
catch (OperationCanceledException)
@@ -899,10 +887,7 @@ private async Task StopRecognitionSessionAsync()
}
_recognitionActive = false;
- _recognitionHealthCheckArmed = false;
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
}
@@ -1016,9 +1001,6 @@ private void OnSpeechHypothesisGenerated(SpeechRecognizer sender, SpeechRecognit
text = args.Hypothesis?.Text?.Trim();
sessionKey = GetCurrentVoiceSessionKey();
_recognitionSessionHadActivity = true;
- _recognitionHealthCheckArmed = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
_lastHypothesisText = text;
_lastHypothesisUtc = DateTime.UtcNow;
if (_status.State != VoiceRuntimeState.RecordingUtterance)
@@ -1070,10 +1052,7 @@ private async Task HandleRecognizedTextAsync(string text)
_lastTranscript = text;
_lastTranscriptUtc = DateTime.UtcNow;
_recognitionSessionHadActivity = true;
- _recognitionHealthCheckArmed = false;
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
cancellationToken = _runtimeCts.Token;
@@ -1262,42 +1241,48 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
return;
}
- RaiseConversationTurn(VoiceConversationDirection.Incoming, text, args.SessionKey);
+ QueueAssistantReplyForPlayback(text, args.SessionKey, out shouldStartPlaybackLoop);
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ }
+ }
- lock (_gate)
- {
- _pendingAssistantReplies.Enqueue((text, args.SessionKey));
- _logger.Info($"Voice reply queued: pending={_pendingAssistantReplies.Count}");
+ private void QueueAssistantReplyForPlayback(string text, string? sessionKey, out bool shouldStartPlaybackLoop)
+ {
+ shouldStartPlaybackLoop = false;
+ RaiseConversationTurn(VoiceConversationDirection.Incoming, text, sessionKey);
- if (!_replyPlaybackLoopActive)
- {
- _replyPlaybackLoopActive = true;
- _isSpeaking = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.PlayingResponse,
- _status.LastError);
- shouldStartPlaybackLoop = true;
- }
- else
- {
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.PlayingResponse,
- _status.LastError);
- }
- }
+ lock (_gate)
+ {
+ _pendingAssistantReplies.Enqueue((text, sessionKey));
+ _logger.Info($"Voice reply queued: pending={_pendingAssistantReplies.Count}");
- if (shouldStartPlaybackLoop)
+ if (!_replyPlaybackLoopActive)
{
- _ = ProcessQueuedAssistantRepliesAsync();
+ _replyPlaybackLoopActive = true;
+ _isSpeaking = true;
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
+ shouldStartPlaybackLoop = true;
+ }
+ else
+ {
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.PlayingResponse,
+ _status.LastError);
}
}
- catch (Exception ex)
+
+ if (shouldStartPlaybackLoop)
{
- _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ _ = ProcessQueuedAssistantRepliesAsync();
}
}
@@ -1550,8 +1535,7 @@ internal static bool ShouldRebuildRecognitionAfterCompletion(
return false;
}
- return sessionHadCaptureSignal ||
- status == SpeechRecognitionResultStatus.UserCanceled;
+ return status == SpeechRecognitionResultStatus.UserCanceled;
}
internal static string DescribeRecognitionCompletionRebuildDecision(
@@ -1590,8 +1574,8 @@ internal static string DescribeRecognitionCompletionRebuildDecision(
return status switch
{
SpeechRecognitionResultStatus.UserCanceled => "user-canceled-without-activity",
- SpeechRecognitionResultStatus.TimeoutExceeded => "timeout-without-capture-signal",
- _ => $"status={status}"
+ SpeechRecognitionResultStatus.TimeoutExceeded => "disabled-official-session-restart-only (status=TimeoutExceeded)",
+ _ => $"disabled-official-session-restart-only (status={status})"
};
}
@@ -1644,6 +1628,28 @@ internal static string SelectRecognizedText(
return latestHypothesisText.Trim();
}
+ internal static bool ShouldClearTranscriptDraftAfterCompletion(
+ bool awaitingReply,
+ bool isSpeaking,
+ bool usedFallbackTranscript)
+ {
+ return !awaitingReply &&
+ !isSpeaking &&
+ !usedFallbackTranscript;
+ }
+
+ internal static bool ShouldRepromptAfterIncompleteRecognition(
+ bool sessionHadActivity,
+ bool awaitingReply,
+ bool isSpeaking,
+ bool usedFallbackTranscript)
+ {
+ return sessionHadActivity &&
+ !awaitingReply &&
+ !isSpeaking &&
+ !usedFallbackTranscript;
+ }
+
private static string CreateReplyPreview(string text)
{
var trimmed = text.Trim();
@@ -1738,6 +1744,9 @@ private async void OnSpeechRecognitionCompleted(
var restartDecisionReason = string.Empty;
var rebuildDecisionReason = string.Empty;
string? fallbackText = null;
+ string? sessionKey = null;
+ var shouldClearDraft = false;
+ var shouldReprompt = false;
lock (_gate)
{
@@ -1754,19 +1763,13 @@ private async void OnSpeechRecognitionCompleted(
_lastHypothesisText,
_lastHypothesisUtc,
DateTime.UtcNow);
+ sessionKey = GetCurrentVoiceSessionKey();
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
restartInProgress = _recognitionRestartInProgress;
if (restartInProgress)
{
_recognitionRestartInProgress = false;
- _recognitionHealthCheckArmed = false;
- }
- else
- {
- _recognitionHealthCheckArmed = false;
}
token = _runtimeCts.Token;
shouldRestart = ShouldRestartRecognitionAfterCompletion(
@@ -1795,6 +1798,15 @@ private async void OnSpeechRecognitionCompleted(
restartInProgress,
_awaitingReply,
_isSpeaking);
+ shouldClearDraft = ShouldClearTranscriptDraftAfterCompletion(
+ _awaitingReply,
+ _isSpeaking,
+ !string.IsNullOrWhiteSpace(fallbackText));
+ shouldReprompt = ShouldRepromptAfterIncompleteRecognition(
+ sessionHadActivity,
+ _awaitingReply,
+ _isSpeaking,
+ !string.IsNullOrWhiteSpace(fallbackText));
}
_logger.Warn(
@@ -1811,6 +1823,18 @@ private async void OnSpeechRecognitionCompleted(
return;
}
+ if (shouldClearDraft)
+ {
+ RaiseTranscriptDraft(string.Empty, sessionKey, clear: true);
+ }
+
+ if (shouldReprompt)
+ {
+ _logger.Warn("Voice recognition session ended after speech activity but without a usable transcript; prompting user to repeat.");
+ QueueAssistantReplyForPlayback(LowConfidenceRepeatPrompt, sessionKey, out _);
+ return;
+ }
+
if (shouldRestart && !token.IsCancellationRequested)
{
await Task.Delay(250, token);
@@ -1897,8 +1921,6 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_recognitionActive = false;
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
_recognitionRestartInProgress = false;
_lastHypothesisText = null;
_lastHypothesisUtc = default;
@@ -1925,7 +1947,6 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
if (captureService != null)
{
- captureService.SignalDetected -= OnCaptureSignalDetected;
try { await captureService.StopAsync(); } catch { }
try { await captureService.DisposeAsync(); } catch { }
}
@@ -2137,72 +2158,6 @@ private async void OnDefaultAudioCaptureDeviceChanged(object sender, DefaultAudi
}
}
- internal static bool ShouldTreatCaptureSignalAsSpeech(float peakLevel)
- {
- return peakLevel >= RecognitionSignalPeakThreshold;
- }
-
- internal static bool ShouldArmRecognitionRecoveryAfterCaptureSignal(
- bool recognitionSessionHadActivity,
- bool recognitionHealthCheckArmed,
- int recognitionSignalBurstCount)
- {
- return !recognitionSessionHadActivity &&
- !recognitionHealthCheckArmed &&
- recognitionSignalBurstCount >= RecognitionSignalBurstThreshold;
- }
-
- internal static bool ShouldDelayRecognitionRecycleForOngoingSpeech(DateTime lastCaptureSignalUtc, DateTime utcNow)
- {
- return lastCaptureSignalUtc != default &&
- utcNow - lastCaptureSignalUtc < RecognitionPostSpeechSilenceBeforeRecycle;
- }
-
- private void OnCaptureSignalDetected(object? sender, VoiceCaptureSignalEventArgs args)
- {
- var shouldStartRecoveryWatchdog = false;
- var generation = 0;
- CancellationToken cancellationToken = default;
-
- lock (_gate)
- {
- if (_runtimeCts == null ||
- !_status.Running ||
- _status.Mode != VoiceActivationMode.TalkMode ||
- !_recognitionActive ||
- _awaitingReply ||
- _isSpeaking)
- {
- return;
- }
-
- if (!ShouldTreatCaptureSignalAsSpeech(args.PeakLevel))
- {
- return;
- }
-
- _recognitionSessionHadCaptureSignal = true;
- _recognitionSignalBurstCount++;
- _lastCaptureSignalUtc = DateTime.UtcNow;
-
- if (ShouldArmRecognitionRecoveryAfterCaptureSignal(
- _recognitionSessionHadActivity,
- _recognitionHealthCheckArmed,
- _recognitionSignalBurstCount))
- {
- _recognitionHealthCheckArmed = true;
- generation = _recognitionSessionGeneration;
- cancellationToken = _runtimeCts.Token;
- shouldStartRecoveryWatchdog = true;
- }
- }
-
- if (shouldStartRecoveryWatchdog)
- {
- _ = MonitorRecognitionSessionHealthAsync(generation, cancellationToken);
- }
- }
-
private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken cancellationToken)
{
SpeechRecognizer? oldRecognizer;
@@ -2224,9 +2179,6 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
_recognitionActive = false;
_recognitionSessionHadActivity = false;
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
- _recognitionHealthCheckArmed = false;
}
if (oldRecognizer != null)
@@ -2291,8 +2243,6 @@ private async Task RebuildVoiceCaptureAsync(string reason, CancellationToken can
captureService = _voiceCaptureService;
settings = Clone(_settings.Voice);
_recognitionSessionHadCaptureSignal = false;
- _recognitionSignalBurstCount = 0;
- _lastCaptureSignalUtc = default;
}
if (captureService == null)
@@ -2305,106 +2255,6 @@ private async Task RebuildVoiceCaptureAsync(string reason, CancellationToken can
_logger.Info($"Voice capture graph rebuilt ({reason})");
}
- private async Task MonitorRecognitionSessionHealthAsync(int generation, CancellationToken cancellationToken)
- {
- try
- {
- await Task.Delay(RecognitionSpeechMismatchDelay, cancellationToken);
-
- bool shouldRecycle;
- bool sawCaptureSignal;
- int signalBurstCount;
- DateTime lastCaptureSignalUtc;
-
- while (true)
- {
- shouldRecycle = false;
- sawCaptureSignal = false;
- signalBurstCount = 0;
- lastCaptureSignalUtc = default;
-
- lock (_gate)
- {
- sawCaptureSignal = _recognitionSessionHadCaptureSignal;
- signalBurstCount = _recognitionSignalBurstCount;
- lastCaptureSignalUtc = _lastCaptureSignalUtc;
- shouldRecycle =
- _recognitionHealthCheckArmed &&
- sawCaptureSignal &&
- _recognitionActive &&
- _runtimeCts != null &&
- !_runtimeCts.IsCancellationRequested &&
- _status.Running &&
- _status.Mode == VoiceActivationMode.TalkMode &&
- !_awaitingReply &&
- !_isSpeaking &&
- generation == _recognitionSessionGeneration;
- }
-
- if (!shouldRecycle)
- {
- return;
- }
-
- if (!ShouldDelayRecognitionRecycleForOngoingSpeech(lastCaptureSignalUtc, DateTime.UtcNow))
- {
- break;
- }
-
- var remainingDelay = RecognitionPostSpeechSilenceBeforeRecycle - (DateTime.UtcNow - lastCaptureSignalUtc);
- if (remainingDelay < TimeSpan.FromMilliseconds(50))
- {
- remainingDelay = TimeSpan.FromMilliseconds(50);
- }
-
- await Task.Delay(remainingDelay, cancellationToken);
- }
-
- lock (_gate)
- {
- if (!(_recognitionHealthCheckArmed &&
- _recognitionActive &&
- _runtimeCts != null &&
- !_runtimeCts.IsCancellationRequested &&
- _status.Running &&
- _status.Mode == VoiceActivationMode.TalkMode &&
- !_awaitingReply &&
- !_isSpeaking &&
- generation == _recognitionSessionGeneration))
- {
- return;
- }
-
- _recognitionHealthCheckArmed = false;
- _recognitionRestartInProgress = true;
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.Arming,
- "Speech recognizer stalled; restarting listening.");
- }
-
- _logger.Warn(
- $"Speech recognizer heard sustained capture audio but produced no recognition activity within {RecognitionSpeechMismatchDelay.TotalSeconds:0}s and after post-speech silence; recycling session (captureSignal={sawCaptureSignal}, signalBursts={signalBurstCount})");
- await StopRecognitionSessionAsync();
- await ResumeRecognitionSessionAsync(
- cancellationToken,
- "capture signal without recognition activity",
- rebuildRecognizer: sawCaptureSignal);
- }
- catch (OperationCanceledException)
- {
- }
- catch (Exception ex)
- {
- lock (_gate)
- {
- _recognitionRestartInProgress = false;
- }
- _logger.Warn($"Speech recognition health check failed: {ex.Message}");
- }
- }
-
private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
{
var settings = _settings.Voice;
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
index caa3ad1..27d9e5c 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
@@ -8,6 +8,9 @@ namespace OpenClawTray.Services.Voice;
internal sealed class WindowsMediaSpeechToTextRoute : IVoiceSpeechToTextRoute
{
+ private static readonly TimeSpan InitialSilenceTimeout = TimeSpan.FromSeconds(30);
+ private static readonly TimeSpan BabbleTimeout = TimeSpan.FromSeconds(4);
+
private readonly IOpenClawLogger _logger;
public WindowsMediaSpeechToTextRoute(IOpenClawLogger logger)
@@ -34,8 +37,8 @@ public async Task<SpeechRecognizer> CreateRecognizerAsync(VoiceSettings settings
{
var recognizer = new SpeechRecognizer();
recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromMilliseconds(settings.TalkMode.EndSilenceMs);
- recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(10);
- recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(4);
+ recognizer.Timeouts.InitialSilenceTimeout = InitialSilenceTimeout;
+ recognizer.Timeouts.BabbleTimeout = BabbleTimeout;
recognizer.Constraints.Add(new SpeechRecognitionTopicConstraint(SpeechRecognitionScenario.Dictation, "always-on-dictation"));
var compilation = await recognizer.CompileConstraintsAsync();
@@ -45,7 +48,8 @@ public async Task<SpeechRecognizer> CreateRecognizerAsync(VoiceSettings settings
throw new InvalidOperationException($"Speech recognizer unavailable: {compilation.Status}");
}
- _logger.Info($"Speech recognizer compiled successfully ({compilation.Status})");
+ _logger.Info(
+ $"Speech recognizer compiled successfully ({compilation.Status}); endSilenceMs={recognizer.Timeouts.EndSilenceTimeout.TotalMilliseconds:0}; initialSilenceMs={recognizer.Timeouts.InitialSilenceTimeout.TotalMilliseconds:0}; babbleMs={recognizer.Timeouts.BabbleTimeout.TotalMilliseconds:0}");
return recognizer;
}
}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 2521600..ad863c8 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -5,6 +5,7 @@
using OpenClawTray.Services;
using OpenClawTray.Services.Voice;
using System;
+using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
@@ -22,6 +23,7 @@ public sealed partial class WebChatWindow : WindowEx
private readonly string _token;
private bool _stripInjectedMemories;
private string _pendingVoiceDraft = string.Empty;
+ private readonly List<VoiceConversationTurnMirror> _pendingVoiceTurns = [];
// Store event handlers for cleanup
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationCompletedEventArgs>? _navigationCompletedHandler;
@@ -29,6 +31,8 @@ public sealed partial class WebChatWindow : WindowEx
public bool IsClosed { get; private set; }
+ private sealed record VoiceConversationTurnMirror(string Direction, string Text);
+
private const string TrayVoiceIntegrationScript = """
(() => {
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
@@ -58,6 +62,67 @@ public sealed partial class WebChatWindow : WindowEx
el.dispatchEvent(new Event('change', { bubbles: true }));
}
};
+ let desiredTurns = [];
+ const ensureTurnsHost = () => {
+ if (!document.body) return null;
+ let host = document.getElementById('openclaw-tray-voice-turns');
+ if (host) return host;
+ host = document.createElement('div');
+ host.id = 'openclaw-tray-voice-turns';
+ Object.assign(host.style, {
+ position: 'fixed',
+ left: '16px',
+ right: '16px',
+ bottom: '88px',
+ zIndex: '2147483000',
+ display: 'flex',
+ flexDirection: 'column',
+ gap: '8px',
+ pointerEvents: 'none',
+ alignItems: 'stretch'
+ });
+ document.body.appendChild(host);
+ return host;
+ };
+ const renderTurns = () => {
+ const host = ensureTurnsHost();
+ if (!host) return false;
+ host.innerHTML = '';
+ const items = Array.isArray(desiredTurns) ? desiredTurns : [];
+ if (items.length === 0) {
+ host.style.display = 'none';
+ return true;
+ }
+ host.style.display = 'flex';
+ for (const item of items) {
+ if (!item || !item.text) continue;
+ const row = document.createElement('div');
+ Object.assign(row.style, {
+ display: 'flex',
+ justifyContent: item.direction === 'incoming' ? 'flex-start' : 'flex-end'
+ });
+ const bubble = document.createElement('div');
+ bubble.textContent = item.text;
+ Object.assign(bubble.style, {
+ maxWidth: 'min(70vw, 720px)',
+ padding: '10px 14px',
+ borderRadius: '16px',
+ boxShadow: '0 8px 20px rgba(15, 23, 42, 0.12)',
+ border: item.direction === 'incoming'
+ ? '1px solid rgba(148, 163, 184, 0.35)'
+ : '1px solid rgba(59, 130, 246, 0.35)',
+ background: item.direction === 'incoming'
+ ? 'rgba(255, 255, 255, 0.94)'
+ : 'rgba(219, 234, 254, 0.96)',
+ color: '#0f172a',
+ font: '500 14px/1.4 \"Segoe UI\", sans-serif',
+ whiteSpace: 'pre-wrap'
+ });
+ row.appendChild(bubble);
+ host.appendChild(row);
+ }
+ return true;
+ };
const applyDraftIfPossible = () => {
const composer = findComposer();
if (!composer) return false;
@@ -95,6 +160,7 @@ public sealed partial class WebChatWindow : WindowEx
refreshScheduled = false;
cleanTextNodes();
applyDraftIfPossible();
+ renderTurns();
});
};
const observer = new MutationObserver(() => refreshView());
@@ -118,6 +184,10 @@ public sealed partial class WebChatWindow : WindowEx
refreshView();
return true;
},
+ setTurns(turns) {
+ desiredTurns = Array.isArray(turns) ? turns : [];
+ return renderTurns();
+ },
clearDraft() {
desiredDraft = '';
return applyDraftIfPossible();
@@ -378,6 +448,25 @@ public async Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
await RefreshTrayVoiceDomStateAsync();
}
+ public async Task AppendVoiceConversationTurnAsync(VoiceConversationTurnEventArgs args)
+ {
+ ArgumentNullException.ThrowIfNull(args);
+
+ if (args.Direction != VoiceConversationDirection.Outgoing ||
+ string.IsNullOrWhiteSpace(args.Message))
+ {
+ return;
+ }
+
+ _pendingVoiceTurns.Add(new VoiceConversationTurnMirror("outgoing", args.Message.Trim()));
+ if (_pendingVoiceTurns.Count > 6)
+ {
+ _pendingVoiceTurns.RemoveAt(0);
+ }
+
+ await RefreshTrayVoiceDomStateAsync();
+ }
+
public async Task SetStripInjectedMemoriesEnabledAsync(bool enabled)
{
_stripInjectedMemories = enabled;
@@ -403,6 +492,10 @@ await WebView.CoreWebView2.ExecuteScriptAsync(
: $"window.__openClawTrayVoice?.setDraft?.({draftJson});";
await WebView.CoreWebView2.ExecuteScriptAsync(script);
+
+ var turnsJson = JsonSerializer.Serialize(_pendingVoiceTurns);
+ await WebView.CoreWebView2.ExecuteScriptAsync(
+ $"window.__openClawTrayVoice?.setTurns?.({turnsJson});");
}
catch (Exception ex)
{
diff --git a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
index 484a9bc..0840aec 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
@@ -91,6 +91,50 @@ public void ConversationTurn_IsForwarded()
Assert.Equal(VoiceConversationDirection.Incoming, received.Direction);
}
+ [Fact]
+ public async Task ConversationTurn_IsMirroredToAttachedWindow()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
+ var window = new FakeVoiceChatWindow();
+ coordinator.AttachWindow(window);
+
+ runtime.RaiseConversationTurn(new VoiceConversationTurnEventArgs
+ {
+ Direction = VoiceConversationDirection.Outgoing,
+ Message = "hello from voice",
+ SessionKey = "main"
+ });
+ await Task.Yield();
+
+ Assert.Equal("hello from voice", window.LastTurnMessage);
+ Assert.Equal(VoiceConversationDirection.Outgoing, window.LastTurnDirection);
+ Assert.Equal(1, window.TurnCallCount);
+ }
+
+ [Fact]
+ public async Task AttachWindow_ReplaysBufferedConversationTurns()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
+
+ runtime.RaiseConversationTurn(new VoiceConversationTurnEventArgs
+ {
+ Direction = VoiceConversationDirection.Outgoing,
+ Message = "replay this",
+ SessionKey = "main"
+ });
+ await Task.Yield();
+
+ var window = new FakeVoiceChatWindow();
+ coordinator.AttachWindow(window);
+ await Task.Yield();
+
+ Assert.Equal("replay this", window.LastTurnMessage);
+ Assert.Equal(VoiceConversationDirection.Outgoing, window.LastTurnDirection);
+ Assert.Equal(1, window.TurnCallCount);
+ }
+
private sealed class ImmediateDispatcher : IUiDispatcher
{
public bool TryEnqueue(Action callback)
@@ -128,6 +172,9 @@ private sealed class FakeVoiceChatWindow : IVoiceChatWindow
public string LastDraftText { get; private set; } = string.Empty;
public bool LastDraftClear { get; private set; }
public int UpdateCallCount { get; private set; }
+ public string LastTurnMessage { get; private set; } = string.Empty;
+ public VoiceConversationDirection? LastTurnDirection { get; private set; }
+ public int TurnCallCount { get; private set; }
public Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
{
@@ -136,5 +183,13 @@ public Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
LastDraftClear = clear;
return Task.CompletedTask;
}
+
+ public Task AppendVoiceConversationTurnAsync(VoiceConversationTurnEventArgs args)
+ {
+ TurnCallCount++;
+ LastTurnMessage = args.Message ?? string.Empty;
+ LastTurnDirection = args.Direction;
+ return Task.CompletedTask;
+ }
}
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index 0bf1451..a1fa893 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -181,12 +181,12 @@ public void DescribeRecognitionCompletionRestartDecision_ExplainsWhyRestartIsBlo
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, true)]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, false)]
[InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, false)]
- [InlineData(SpeechRecognitionResultStatus.Success, false, true, false, false, false, true)]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, true, false, false, false, false)]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, true, false, false, false, false, false)]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, true, false, false, false)]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, true, false, false)]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, true, false)]
- public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledSessions(
+ public void ShouldRebuildRecognitionAfterCompletion_RebuildsOnlyForUserCanceledWithoutActivity(
SpeechRecognitionResultStatus status,
bool sessionHadActivity,
bool sessionHadCaptureSignal,
@@ -209,8 +209,8 @@ public void ShouldRebuildRecognitionAfterCompletion_OnlyRebuildsForDeafCanceledS
[Theory]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, false, false, false, "capture-signal-without-recognition")]
[InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false, false, false, false, "user-canceled-without-activity")]
- [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, "timeout-without-capture-signal")]
- [InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, "status=Success")]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false, false, false, false, "disabled-official-session-restart-only (status=TimeoutExceeded)")]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, false, false, false, false, "disabled-official-session-restart-only (status=Success)")]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, true, true, false, false, false, "session-had-activity")]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, true, false, false, "controlled-restart-in-progress")]
[InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, true, false, true, false, "awaiting-reply")]
@@ -235,62 +235,6 @@ public void DescribeRecognitionCompletionRebuildDecision_ExplainsWhyRebuildIsBlo
Assert.Equal(expected, result);
}
- [Theory]
- [InlineData(0.029f, false)]
- [InlineData(0.03f, true)]
- [InlineData(0.08f, true)]
- public void ShouldTreatCaptureSignalAsSpeech_RequiresSpeechLikePeak(float peakLevel, bool expected)
- {
- var method = typeof(VoiceService).GetMethod(
- "ShouldTreatCaptureSignalAsSpeech",
- BindingFlags.NonPublic | BindingFlags.Static)!;
-
- var result = (bool)method.Invoke(null, [peakLevel])!;
-
- Assert.Equal(expected, result);
- }
-
- [Theory]
- [InlineData(false, false, 1, false)]
- [InlineData(false, false, 3, false)]
- [InlineData(false, false, 4, true)]
- [InlineData(true, false, 4, false)]
- [InlineData(false, true, 4, false)]
- public void ShouldArmRecognitionRecoveryAfterCaptureSignal_RequiresBurstWithoutRecognitionActivity(
- bool recognitionSessionHadActivity,
- bool recognitionHealthCheckArmed,
- int recognitionSignalBurstCount,
- bool expected)
- {
- var method = typeof(VoiceService).GetMethod(
- "ShouldArmRecognitionRecoveryAfterCaptureSignal",
- BindingFlags.NonPublic | BindingFlags.Static)!;
-
- var result = (bool)method.Invoke(
- null,
- [recognitionSessionHadActivity, recognitionHealthCheckArmed, recognitionSignalBurstCount])!;
-
- Assert.Equal(expected, result);
- }
-
- [Theory]
- [InlineData(0, true)]
- [InlineData(500, true)]
- [InlineData(749, true)]
- [InlineData(750, false)]
- [InlineData(1200, false)]
- public void ShouldDelayRecognitionRecycleForOngoingSpeech_RequiresShortRecentSignal(int elapsedMs, bool expected)
- {
- var method = typeof(VoiceService).GetMethod(
- "ShouldDelayRecognitionRecycleForOngoingSpeech",
- BindingFlags.NonPublic | BindingFlags.Static)!;
- var now = new DateTime(2026, 3, 25, 21, 36, 35, DateTimeKind.Utc);
-
- var result = (bool)method.Invoke(null, [now.AddMilliseconds(-elapsedMs), now])!;
-
- Assert.Equal(expected, result);
- }
-
[Theory]
[InlineData(16000, 80, 1280)]
[InlineData(16000, 0, 1280)]
@@ -377,6 +321,52 @@ public void SelectCompletionFallbackText_PromotesRecentHypothesisWhenSessionHadA
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(false, false, false, true)]
+ [InlineData(true, false, false, false)]
+ [InlineData(false, true, false, false)]
+ [InlineData(false, false, true, false)]
+ public void ShouldClearTranscriptDraftAfterCompletion_ClearsOnlyWhenNoReplyOrFallbackInFlight(
+ bool awaitingReply,
+ bool isSpeaking,
+ bool usedFallbackTranscript,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldClearTranscriptDraftAfterCompletion",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(
+ null,
+ [awaitingReply, isSpeaking, usedFallbackTranscript])!;
+
+ Assert.Equal(expected, result);
+ }
+
+ [Theory]
+ [InlineData(true, false, false, false, true)]
+ [InlineData(false, false, false, false, false)]
+ [InlineData(true, true, false, false, false)]
+ [InlineData(true, false, true, false, false)]
+ [InlineData(true, false, false, true, false)]
+ public void ShouldRepromptAfterIncompleteRecognition_OnlyPromptsWhenSpeechWasHeardButNothingUsableSurvived(
+ bool sessionHadActivity,
+ bool awaitingReply,
+ bool isSpeaking,
+ bool usedFallbackTranscript,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldRepromptAfterIncompleteRecognition",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(
+ null,
+ [sessionHadActivity, awaitingReply, isSpeaking, usedFallbackTranscript])!;
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(true, VoiceActivationMode.TalkMode, null, AudioDeviceRole.Default, true)]
[InlineData(true, VoiceActivationMode.TalkMode, "", AudioDeviceRole.Default, true)]
From c6be9007c00fec0b044841b261a775c3f0b90e3b Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Thu, 26 Mar 2026 23:33:06 +0000
Subject: [PATCH 77/83] Ship v0.1rc3 voice chat and docs fixes
---
docs/VOICE-MODE.md | 215 ++++++++----------
.../Capabilities/VoiceCapability.cs | 4 +-
src/OpenClaw.Shared/VoiceModeSchema.cs | 2 +-
.../Properties/AssemblyInfo.cs | 3 +
.../Services/Voice/VoiceService.cs | 72 ++++--
.../Voice/WindowsMediaSpeechToTextRoute.cs | 2 +-
.../Windows/WebChatWindow.xaml.cs | 108 ++++++---
.../OpenClaw.Shared.Tests/CapabilityTests.cs | 29 +++
.../VoiceModeSchemaTests.cs | 2 +-
.../VoiceServiceTransportTests.cs | 20 ++
.../WebChatWindowDomBridgeTests.cs | 36 +++
11 files changed, 315 insertions(+), 178 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Properties/AssemblyInfo.cs
create mode 100644 tests/OpenClaw.Tray.Tests/WebChatWindowDomBridgeTests.cs
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index a97d024..649f6d2 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -8,7 +8,7 @@ This document defines the voice subsystem for the Windows node only. It introduc
- Utilise minimal touch points to the existing app to reduce the potential for screw-ups.
- Use NanoWakeWord for wakeword detection on-device
- Present the user-facing mode names as `Voice Wake` and `Talk Mode`
-- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
+- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins (available to all)
- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
- Make adding new voice providers an update to a Json catalog, rather than requiring code changes
- Reuse the existing node capability pattern instead of introducing a parallel control path
@@ -17,7 +17,7 @@ This document defines the voice subsystem for the Windows node only. It introduc
## Non-Goals
-- True full-duplex or chunk-streaming audio transport between node and gateway
+- True full-duplex, chunk-streaming audio transport between node and gateway
- Subtantial changes to the existing project
## Design Position
@@ -41,7 +41,7 @@ The tray app now uses user-facing names (borrowed from the macOS app) rather tha
| Internal Mode | Visible Name | Availability |
|---|---|---|
| `Off` | Off | available |
-| `VoiceWake` | Voice Wake | visible but disabled for now |
+| `VoiceWake` | Voice Wake | visible but disabled for now (coming soon) |
| `TalkMode` | Talk Mode | available |
The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
@@ -87,13 +87,13 @@ The node capability command surface is:
- `voice.stop`
- `voice.pause`
- `voice.resume`
-- `voice.skip`
+- `voice.response.skip`
These commands are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/VoiceModeSchema.cs) and handled by [VoiceCapability.cs](../src/OpenClaw.Shared/Capabilities/VoiceCapability.cs).
-`voice.settings.get` / `voice.settings.set` are the configuration API.
+`voice.devices.list` / `voice.settings.get` / `voice.settings.set` are the configuration API.
-`voice.start` / `voice.stop` / `voice.pause` / `voice.resume` / `voice.skip` are the runtime control API.
+`voice.start` / `voice.stop` / `voice.pause` / `voice.resume` / `voice.response.skip` are the runtime control API.
### Status Surface
@@ -158,7 +158,7 @@ The main remaining gap is streaming playback from the first audio chunk. The Azu
- MiniMax now uses the provider catalog's WebSocket TTS contract, but the current player still waits for a complete playable stream before output starts
- ElevenLabs now uses the provider catalog's `stream-input` WebSocket contract, but the current player still waits for a complete playable stream before output starts
-So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming. This is, however, planned for an early release.
+So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming. **This is planned for an early release, so will be upgraded soon!**
## Tray Chat Integration Decision
@@ -218,6 +218,8 @@ It also fits with the planned voice mode *repeater form*, which will act as an o
## Provider Selection
+Voice providers are contained in a JSON catalogue rather than being hard-coded. See below.
+
Voice settings now carry explicit provider ids for both STT and TTS:
- `Voice.SpeechToTextProviderId`
@@ -229,7 +231,7 @@ Runtime behavior in the current phase:
- `windows` is implemented for both STT and TTS
- the `windows` STT route is a pure `Windows.Media.SpeechRecognition.SpeechRecognizer` path with no `AudioGraph` dependency
-- `windows` STT is currently treated as `half-duplex, non-streamed`
+- `windows` STT is `half-duplex, non-streamed`
- `http/ws` is now catalogued as a visible "coming soon" STT slot for generic streaming HTTP/WebSocket adapters
- built-in catalog entries exist for both `minimax` and `elevenlabs` TTS
- `minimax` defaults to `speech-2.8-turbo` and `English_MatureBoss` at present
@@ -249,7 +251,7 @@ The Settings panel now shows short inline descriptions for:
Those provider descriptions are drawn directly from the provider catalog.
-When `Windows Speech Recognition` is selected for STT, the Settings panel now forces both audio device pickers back to the system defaults and greys them out. That matches the current Windows route limitation and avoids advertising per-device microphone routing that does not exist on this route yet.
+When `Windows Speech Recognition` is selected for STT, the Settings panel now forces both audio device pickers back to the system defaults and greys them out. That matches the current Windows route limitation.
### Provider Catalog
@@ -446,7 +448,7 @@ The voice subsystem is introduced as a new node capability category: `voice`.
| `voice.stop` | Stop the voice runtime | `VoiceStopArgs` | `VoiceStatusInfo` |
| `voice.pause` | Pause the active voice runtime | `VoicePauseArgs` | `VoiceStatusInfo` |
| `voice.resume` | Resume a paused voice runtime | `VoiceResumeArgs` | `VoiceStatusInfo` |
-| `voice.skip` | Skip the currently spoken reply and advance the queue if another reply is pending | `VoiceSkipArgs` | `VoiceStatusInfo` |
+| `voice.response.skip` | Skip the currently spoken reply and advance the queue if another reply is pending | `VoiceSkipArgs` | `VoiceStatusInfo` |
### Payload Types
@@ -572,51 +574,50 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
At runtime today:
- `Voice.OutputDeviceId` is applied to Talk Mode playback through `MediaPlayer.AudioDevice`
-- `VoiceCaptureService` now runs an `AudioGraph` capture pipeline in parallel with Talk Mode and binds it to the selected or default microphone device
-- `Voice.InputDeviceId` is now used by that `AudioGraph` capture path, but transcript generation still uses the Windows default speech input path until the STT adapter migration is complete
-- Talk Mode only advertises `ListeningContinuously` after the capture graph has produced live frames and the recognizer warm-up window has elapsed, so the status acts as a real ΓÇ£you can start talking nowΓÇ¥ signal instead of a timer-only guess
-- recognizer recovery is now speech-triggered rather than silence-triggered: the Windows recognizer is only recycled when sustained capture speech is present but no recognition activity follows
-- when a recognizer session ends after real hypothesis activity but before a final result arrives, Talk Mode now promotes the last recent hypothesis and submits it instead of dropping the utterance
-- the speech-mismatch recovery watchdog is single-owner and only armed from capture speech, so a new recognition session does not spawn overlapping recovery loops
-- when the system default capture device changes and Talk Mode is using the default mic, the recognizer is rebuilt so device switches such as AirPods are picked up without a full app restart
-- explicit non-default microphone transcript generation is still pending the planned STT adapter migration
+- the selectable `windows` STT route is a pure `Windows.Media.SpeechRecognition.SpeechRecognizer` path with no `AudioGraph` running in parallel
+- when `windows` STT is selected, Windows owns speech input on the system default speech device, so Settings forces both device pickers back to system defaults and greys them out
+- Talk Mode only advertises `ListeningContinuously` after the current recognizer session starts and the recognizer warm-up window elapses, so the status acts as a real ΓÇ£you can start talking nowΓÇ¥ signal instead of a timer-only guess
+- if a recognition session ends with speech activity but no usable transcript, Talk Mode clears the draft and gives a short local repeat prompt instead of silently doing nothing
+- user voice turns are mirrored locally into the tray chat window through the WebView DOM bridge, separate from the upstream `chat.send` transport path
+- streaming `AudioGraph`-based STT routes remain scaffolded in code but are not implemented or user-selectable yet
## Current Runtime Architecture
-The current Windows implementation is still centred on `VoiceService`, with a few supporting seams around it:
+The current Windows implementation is still centred on `VoiceService`, with a few supporting components around it:
- `VoiceCapability`
exposes shared `voice.*` commands to the node/gateway surface
-- `VoiceCaptureService`
- owns the new `AudioGraph` capture backbone, selected/default microphone binding, and live signal detection
- `VoiceService`
- owns Talk Mode runtime state, recognizer/TTS integration, reply queuing, timeouts, gateway reply handling, and the transition layer between `AudioGraph` capture and the current recognizer-owned STT path
+ owns Talk Mode runtime state, Windows speech recognizer/TTS integration, reply queuing, timeouts, gateway reply handling, repeat prompts, and recognition-session restart/rebuild decisions
- `VoiceChatCoordinator`
mirrors interim transcript drafts and conversation turns into the tray UI without making the chat window part of the transport path
- `OpenClawGatewayClient`
carries direct `chat.send`, final chat events, and the `sessions.preview` fallback path for bare final markers
- `WebChatWindow`
- mirrors live transcript drafts locally and optionally strips injected `<relevant-memories>` blocks from rendered chat text
+ mirrors live transcript drafts and outgoing voice turns locally and optionally strips injected `<relevant-memories>` blocks from rendered chat text
+- `VoiceCaptureService`
+ remains in the codebase as the planned `AudioGraph` capture backbone for future streaming/selected-device STT routes, but is not part of the active `windows` STT runtime path
### Current End-to-End Talk Mode
```mermaid
flowchart LR
- A["User speech"] --> B["VoiceCaptureService<br/>AudioGraph on selected/default mic"]
- A --> C["Windows SpeechRecognizer<br/>continuous dictation on current default mic"]
-
- B --> D["FrameCaptured / SignalDetected"]
- D --> E["VoiceService<br/>capture-backed health + device state"]
-
- C --> F["HypothesisGenerated<br/>interim text"]
- F --> G["VoiceService<br/>draft event"]
- G --> H["VoiceChatCoordinator"]
- H --> I["WebChatWindow<br/>local compose-box mirror only"]
-
- C --> J["ResultGenerated<br/>final Medium/High text"]
- J --> K["VoiceService<br/>duplicate guard + late hypothesis promotion"]
- K --> L["Stop recognition session"]
- L --> M["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
+ A["User speech"] --> B["Windows SpeechRecognizer<br/>continuous dictation on default speech input"]
+ B --> C["HypothesisGenerated<br/>interim text"]
+ C --> D["VoiceService<br/>draft event"]
+ D --> E["VoiceChatCoordinator"]
+ E --> F["WebChatWindow<br/>local compose-box mirror"]
+
+ B --> G["ResultGenerated<br/>final text + confidence"]
+ G --> H{"confidence Medium/High?"}
+ H -- "yes" --> I["VoiceService<br/>duplicate guard + direct submit"]
+ H -- "no" --> J["Ignore as unsendable final"]
+
+ I --> K["Raise outgoing conversation turn"]
+ K --> E
+ I --> L["Clear local draft"]
+ L --> E
+ I --> M["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
M --> N["OpenClaw / session pipeline"]
N --> O["Chat final event"]
O --> P{"assistant text present?"}
@@ -633,21 +634,22 @@ flowchart LR
W --> X["MediaPlayer<br/>selected OutputDeviceId if set"]
X --> Y["Speaker / headset output"]
Y --> Z["Resume recognition when queue drains"]
+ J --> AA["Session completes without usable transcript"]
+ AA --> AB["Clear draft + local repeat prompt"]
```
### Current Processing Stages
| Stage | Component | Input | Output |
|---|---|---|---|
-| 1 | `VoiceCaptureService` | selected/default microphone device | continuous frame and signal events from `AudioGraph` |
-| 2 | `SpeechRecognizer` | Windows default speech-input path | interim/final transcript text |
-| 3 | `VoiceService` | capture signal + final transcript text | health/restart decisions, de-duplicated transcript, runtime state changes |
-| 4 | `VoiceChatCoordinator` | interim/final draft events | mirrored tray chat compose text |
-| 5 | `OpenClawGatewayClient` | transcript text + session key | `chat.send` request + assistant reply events |
-| 6 | `OpenClawGatewayClient` preview fallback | bare final chat marker | assistant preview text, guarded against stale replay |
-| 7 | `VoiceService` reply queue | assistant reply text | ordered reply playback work |
-| 8 | `VoiceCloudTextToSpeechClient` / `SpeechSynthesizer` | assistant reply text | complete playable audio stream |
-| 9 | `MediaPlayer` | complete playable audio stream | rendered audio on default or selected speaker |
+| 1 | `SpeechRecognizer` | Windows default speech-input path | interim/final transcript text |
+| 2 | `VoiceService` | interim/final transcript text | runtime state changes, transcript submission, repeat-prompt decisions |
+| 3 | `VoiceChatCoordinator` | draft/conversation events | mirrored tray chat compose text and mirrored outgoing voice turns |
+| 4 | `OpenClawGatewayClient` | transcript text + session key | `chat.send` request + assistant reply events |
+| 5 | `OpenClawGatewayClient` preview fallback | bare final chat marker | assistant preview text, guarded against stale replay |
+| 6 | `VoiceService` reply queue | assistant reply text | ordered reply playback work |
+| 7 | `VoiceCloudTextToSpeechClient` / `SpeechSynthesizer` | assistant reply text | complete playable audio stream |
+| 8 | `MediaPlayer` | complete playable audio stream | rendered audio on default or selected speaker |
## Planned AudioGraph Input Architecture
@@ -679,7 +681,7 @@ flowchart TD
J --> K["OpenClawGatewayClient<br/>chat.send + reply events"]
```
-### Proposed Seams
+### Planned Component Boundaries
The target split should look like this:
@@ -687,6 +689,7 @@ The target split should look like this:
- owns `AudioGraph`
- binds to an explicit input device id when one is selected
- emits continuous PCM frames
+ - supports streaming/full-duplex STT engines
- `IVoiceActivityDetector`
- emits speech / silence transitions from frame data
- `IUtteranceAssembler`
@@ -714,30 +717,20 @@ Suggested shape:
Likely first adapters:
-- `WindowsSpeechToTextAdapter`
- only if Windows gives us a clean explicit-audio-input path
- `StreamingCloudSpeechToTextAdapter`
- for providers that accept pushed PCM/audio streams
+ - for providers that accept pushed PCM/audio streams
+ - will also support Foundry Local and other self-contained models
- `UtteranceCloudSpeechToTextAdapter`
- for providers that still expect bounded utterance uploads
+ - for providers that still expect bounded utterance uploads
## Selected-Device Roadmap
The current selected-device position is now:
- selected non-default speaker: implemented
-- selected/default microphone binding for `AudioGraph` capture: implemented
+- selected/default microphone binding for future `AudioGraph` capture routes: scaffolded, but not active in the current selectable runtime
- selected non-default microphone for actual transcript generation: not implemented yet
-Recommended engineering order:
-
-1. keep the current selected-speaker playback support
-2. extend the live `VoiceCaptureService` path into the STT side
-3. move Talk Mode input from `SpeechRecognizer` ownership to captured PCM frames
-4. introduce `ISpeechToTextAdapter`
-5. complete explicit selected-microphone transcript generation
-6. then revisit duplex/barge-in and streaming STT
-
## Control Flow
```mermaid
@@ -772,7 +765,7 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
VoiceCap-->>Gateway: VoiceStatusInfo
- Gateway->>VoiceCap: voice.skip(reason=...)
+ Gateway->>VoiceCap: voice.response.skip(reason=...)
VoiceCap->>Coord: SkipCurrentReply()
Coord-->>VoiceCap: VoiceStatusInfo
VoiceCap-->>Gateway: VoiceStatusInfo
@@ -804,7 +797,7 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
## Provider Direction
-Provider support is now part of the Windows voice subsystem roadmap, not a hypothetical extension:
+Provider support is now part of the Windows voice subsystem roadmap:
- `MiniMax` and `ElevenLabs` TTS are both expressed through built-in catalog contracts
- additional HTTP or WebSocket TTS providers can be added by extending the shipped catalog without recompiling the tray app itself
@@ -825,7 +818,7 @@ Status values used below:
- `Supported`
- `Partial`
- `NotSupported (planned)`
-- `Exceeded*`
+- `Exceeded`
| macOS feature | Current Windows state | Notes |
|---|---|---|
@@ -870,15 +863,37 @@ Notes:
- it should remain visible but disabled in Settings until the adapter contract is real and testable
- it should become the shared base path for provider-specific adapters when a generic contract is sufficient
-### Story: Talk Mode overlay and visible phase parity
+### Story: True streaming TTS playback
+
+Start speaking assistant replies from the first usable audio chunk instead of waiting for a complete playable stream.
+
+Notes:
+
+- the current implementation uses WebSocket transport for MiniMax, but still buffers the entire audio response before playback begins
+- `firstChunk=...ms` in the log is currently provider-chunk arrival time, not actual speech-start time
+- implement a playback path that can consume incremental audio data as it arrives from the provider
+- the provider catalog contract should remain transport-driven and provider-agnostic, so streaming behavior should be expressed through the existing TTS contract model rather than hard-coded for MiniMax
+- preserve the existing queued reply behavior, skip support, and late-reply handling while switching playback to progressive output
+- add timing logs that separate `firstChunk`, `playbackStart`, and `playbackEnd` so latency improvements are measurable
+
+
+### Story: Compact Voice Status Strip
-Add a Talk Mode overlay that makes `Listening`, `Thinking`, and `Speaking` visible to the user in the same way the macOS experience does.
+Add an optional tiny always-on-top voice strip window for all Voice Modes. This strip will also provide parity
+with the macOS talk mode overlay with visible phase (currently implemented via dynamic icon updates).
Notes:
-- the current tray icon and status window are not equivalent to an always-visible Talk Mode surface
-- the overlay should expose phase transitions clearly and support later stop / dismiss controls
-- this should be designed alongside the existing compact voice-strip idea so the two UI surfaces do not conflict
+- user-configurable show / hide
+- intended to be a minimal one-line-high display with a small amount of padding
+- should show:
+ - current voice state
+ - rolling live transcript while listening
+ - rolling assistant text while speaking
+ - a skip / cut-off control while speaking
+- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
+- if implemented later, the strip should use the shared runtime control API described elsewhere in this document.
+
### Story: Talk Mode overlay controls
@@ -890,15 +905,6 @@ Notes:
- Windows currently requires tray or settings interaction instead
- this should plug into the shared runtime control API rather than directly manipulating `VoiceService`
-### Story: Same-as-typing WebChat parity for Talk Mode
-
-Decide whether Windows Talk Mode should optionally route sent transcripts through a typed-chat equivalent path so WebChat behavior matches manual typing more closely.
-
-Notes:
-
-- current Windows Talk Mode intentionally uses direct `chat.send`
-- replies still appear in WebChat through session updates, but the send path is not literally the same as typed WebChat submission
-- this should only be revisited if the memory / prompt-shaping issues can be fixed without reintroducing transport fragility
### Story: Voice directives in replies
@@ -960,7 +966,8 @@ Notes:
- keep this visible but greyed out in settings until the embedded runtime is implemented
- the user should be able to choose their own downloaded model bundle and language-appropriate package
-- model lifecycle, validation, and error reporting should be handled in the embedded adapter rather than in the Windows.Media route
+ - should we provide them with any assistance? Probably, point them to Whisper downloads, but the final choice must be theirs.
+- model lifecycle, validation, and error reporting should be handled in the embedded adapter
### Story: Full-duplex / barge-in Talk Mode
@@ -977,7 +984,8 @@ Notes:
- barge-in detection and playback interruption rules
- a policy for whether interrupt speech cancels the current reply or queues behind it
- additional runtime control/status so the UI can show when barge-in is armed
-- this should be treated as a separate engineering phase, not a small extension of the current Talk Mode runtime
+- the macOS app uses barge-in to cancel/skip the current playback.
+
### Story: Voice Wake wake-word runtime
@@ -995,9 +1003,11 @@ Implement a Windows push-to-talk capture path alongside wake-word activation.
Notes:
+- Implement as a separate Voice Mode and have a button on the overlay
- this should support press-to-capture, release-to-finalize semantics
- it should pause the wake runtime while push-to-talk capture is active, then resume it cleanly afterward
-- Windows-specific hotkey and permissions behavior should be documented explicitly once chosen
+- Windows-specific hotkey and permissions behavior should be documented explicitly once chosen if any
+
### Story: Voice Wake overlay lifecycle
@@ -1009,17 +1019,6 @@ Notes:
- manual dismiss must never block recognizer restart
- overlay and runtime should be coordinated through a session controller rather than direct UI coupling
-### Story: Voice Wake settings parity
-
-Add the user-facing Voice Wake settings surface that exists on macOS.
-
-Notes:
-
-- include language and mic pickers
-- include a live level meter
-- include trigger-word editing or table management
-- include a local-only tester that does not forward
-- preserve the chosen mic if it disconnects, surface a disconnected hint, and fall back to the system default until it returns
### Story: Voice Wake sounds and chimes
@@ -1062,35 +1061,6 @@ Notes:
- expose the tuning through voice settings rather than hard-coded constants alone
-### Story: Compact Voice Status Strip
-
-Add an optional tiny always-on-top voice strip window for Talk Mode.
-
-Notes:
-
-- user-configurable show / hide
-- intended to be a minimal one-line-high display with a small amount of padding
-- should show:
- - current voice state
- - rolling live transcript while listening
- - rolling assistant text while speaking
- - a skip / cut-off control while speaking
-- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
-- if implemented later, the strip should use the shared runtime control API described elsewhere in this document.
-
-### Story: True streaming TTS playback
-
-Start speaking assistant replies from the first usable audio chunk instead of waiting for a complete playable stream.
-
-Notes:
-
-- the current implementation uses WebSocket transport for MiniMax, but still buffers the entire audio response before playback begins
-- `firstChunk=...ms` in the log is currently provider-chunk arrival time, not actual speech-start time
-- implement a playback path that can consume incremental audio data as it arrives from the provider
-- the provider catalog contract should remain transport-driven and provider-agnostic, so streaming behavior should be expressed through the existing TTS contract model rather than hard-coded for MiniMax
-- preserve the existing queued reply behavior, skip support, and late-reply handling while switching playback to progressive output
-- add timing logs that separate `firstChunk`, `playbackStart`, and `playbackEnd` so latency improvements are measurable
-
## Commit Timeline
Append one new line to this timeline for every future voice-mode commit.
@@ -1157,3 +1127,4 @@ Append one new line to this timeline for every future voice-mode commit.
- `2026-03-26` Cleared the controlled-restart latch as soon as a new recognition session successfully starts, and on failed resume attempts, so Talk Mode does not get stranded in `controlled-restart-in-progress` after an idle/deaf recognizer recycle.
- `2026-03-26` Split Talk Mode STT startup into explicit route classes: the built-in `windows` provider now uses a pure `Windows.Media` path with no AudioGraph, while `foundry-local` and `sherpa-onnx` are separated into dedicated future AudioGraph/embedded route classes instead of sharing the Windows pipeline.
- `2026-03-26` Updated Talk Mode to mirror sent voice turns back into the tray chat window, clear stale low-confidence drafts, locally reprompt when speech activity ends without a usable transcript, surface provider and mode descriptions in Settings, expose `http/ws` and `sherpa-onnx` as coming-soon STT catalog entries, and lock device selection to system defaults when Windows Speech Recognition is selected.
+- `2026-03-26` Re-anchored mirrored outgoing voice turns beside the live tray-chat composer instead of rendering them in a floating overlay, added seam-level DOM bridge tests for that WebView path, renamed the reply-skip command to `voice.response.skip`, and corrected the architecture document to match the current pure Windows.Media runtime.
diff --git a/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs b/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
index fe1ddea..728b8fd 100644
--- a/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
+++ b/src/OpenClaw.Shared/Capabilities/VoiceCapability.cs
@@ -7,6 +7,8 @@ namespace OpenClaw.Shared.Capabilities;
public class VoiceCapability : NodeCapabilityBase
{
+ private const string LegacySkipCommand = "voice.skip";
+
private static readonly JsonSerializerOptions s_jsonOptions = new()
{
PropertyNameCaseInsensitive = true
@@ -42,7 +44,7 @@ public override async Task<NodeInvokeResponse> ExecuteAsync(NodeInvokeRequest re
VoiceCommands.Stop => await HandleStopAsync(request),
VoiceCommands.Pause => await HandlePauseAsync(request),
VoiceCommands.Resume => await HandleResumeAsync(request),
- VoiceCommands.Skip => await HandleSkipAsync(request),
+ VoiceCommands.Skip or LegacySkipCommand => await HandleSkipAsync(request),
_ => Error($"Unknown command: {request.Command}")
};
}
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 6ba5202..356843b 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -15,7 +15,7 @@ public static class VoiceCommands
public const string Stop = "voice.stop";
public const string Pause = "voice.pause";
public const string Resume = "voice.resume";
- public const string Skip = "voice.skip";
+ public const string Skip = "voice.response.skip";
private static readonly ReadOnlyCollection<string> s_all = Array.AsReadOnly(
[
diff --git a/src/OpenClaw.Tray.WinUI/Properties/AssemblyInfo.cs b/src/OpenClaw.Tray.WinUI/Properties/AssemblyInfo.cs
new file mode 100644
index 0000000..a3f7e8f
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Properties/AssemblyInfo.cs
@@ -0,0 +1,3 @@
+using System.Runtime.CompilerServices;
+
+[assembly: InternalsVisibleTo("OpenClaw.Tray.Tests")]
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 350f005..ec51121 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -611,7 +611,7 @@ private async Task ConfigurePlaybackOutputDeviceAsync(MediaPlayer player, VoiceS
}
player.AudioDevice = selectedRenderDevice;
- _logger.Info($"Voice playback output device set to {selectedRenderDevice.Name}");
+ _logger.Debug($"Voice playback output device set to {selectedRenderDevice.Name}");
}
catch (Exception ex)
{
@@ -722,7 +722,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
_recognitionSessionHadCaptureSignal = false;
}
- _logger.Info("Starting speech recognition session");
+ _logger.Debug("Starting speech recognition session");
await recognizer.ContinuousRecognitionSession.StartAsync();
lock (_gate)
@@ -742,7 +742,7 @@ private async Task StartRecognitionSessionAsync(bool updateListeningStatus = tru
}
}
- _logger.Info("Speech recognition session started");
+ _logger.Debug("Speech recognition session started");
if (updateListeningStatus)
{
_ = MonitorListeningReadyAsync(generation, runtimeToken);
@@ -799,7 +799,7 @@ private async Task MonitorListeningReadyAsync(int generation, CancellationToken
var readinessSource = captureService == null
? "recognizer warm-up completed"
: "capture frames observed and recognizer warm-up completed";
- _logger.Info(
+ _logger.Debug(
$"Speech pipeline ready; {readinessSource} ({InitialRecognitionReadyDelay.TotalMilliseconds:0}ms)");
}
}
@@ -842,7 +842,7 @@ private async Task ResumeRecognitionSessionAsync(
catch (Exception ex)
{
currentError = GetUserFacingErrorMessage(ex);
- _logger.Warn(
+ _logger.Info(
$"Voice recognition resume failed ({reason}, attempt {attempt}/{maxAttempts}): {ex.Message}");
lock (_gate)
@@ -920,7 +920,7 @@ private async void OnSpeechResultGenerated(
result.Confidence == SpeechRecognitionConfidence.Rejected ||
result.Confidence == SpeechRecognitionConfidence.Low)
{
- _logger.Info($"Voice recognition ignored result with confidence {result.Confidence}: {text}");
+ _logger.Debug($"Voice recognition ignored result with confidence {result.Confidence}: {text}");
return;
}
@@ -1045,7 +1045,7 @@ private async Task HandleRecognizedTextAsync(string text)
if (string.Equals(text, _lastTranscript, StringComparison.OrdinalIgnoreCase) &&
DateTime.UtcNow - _lastTranscriptUtc < DuplicateTranscriptWindow)
{
- _logger.Info($"Voice recognition suppressed duplicate transcript: {text}");
+ _logger.Debug($"Voice recognition suppressed duplicate transcript: {text}");
return;
}
@@ -1084,9 +1084,9 @@ private async Task HandleRecognizedTextAsync(string text)
var directSendStopwatch = Stopwatch.StartNew();
await client.SendChatMessageAsync(text, sessionKey);
directSendElapsedMs = directSendStopwatch.ElapsedMilliseconds;
- _logger.Info($"Voice direct send path: elapsed={directSendElapsedMs}ms");
+ _logger.Debug($"Voice direct send path: elapsed={directSendElapsedMs}ms");
- _logger.Info(
+ _logger.Debug(
$"Voice pre-response latency: recognitionStop={recognitionStopElapsedMs}ms transportReady={transportReadyElapsedMs}ms directSend={directSendElapsedMs}ms total={pipelineStopwatch.ElapsedMilliseconds}ms");
lock (_gate)
{
@@ -1101,7 +1101,7 @@ private async Task HandleRecognizedTextAsync(string text)
_status.LastUtteranceUtc = DateTime.UtcNow;
}
- _logger.Info("Voice response wait started");
+ _logger.Debug("Voice response wait started");
RaiseConversationTurn(VoiceConversationDirection.Outgoing, text, sessionKey);
RaiseTranscriptDraft(string.Empty, sessionKey, clear: true);
_ = MonitorReplyTimeoutAsync(text, cancellationToken);
@@ -1153,7 +1153,7 @@ private async Task MonitorReplyTimeoutAsync(string transcript, CancellationToken
if (shouldResume)
{
- _logger.Warn(
+ _logger.Info(
$"Voice reply wait timed out after {ReplyTimeout.TotalSeconds:0}s; accepting late replies for {LateReplyGraceWindow.TotalSeconds:0}s on session {lateReplySessionKey ?? "(none)"}");
await ResumeRecognitionSessionAsync(cancellationToken, "reply timeout");
}
@@ -1215,7 +1215,7 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
if (acceptedViaLateReplyGrace)
{
- _logger.Warn($"Voice accepted late assistant reply after timeout for session {args.SessionKey}");
+ _logger.Info($"Voice accepted late assistant reply after timeout for session {args.SessionKey}");
}
if (string.IsNullOrWhiteSpace(text))
@@ -1257,7 +1257,7 @@ private void QueueAssistantReplyForPlayback(string text, string? sessionKey, out
lock (_gate)
{
_pendingAssistantReplies.Enqueue((text, sessionKey));
- _logger.Info($"Voice reply queued: pending={_pendingAssistantReplies.Count}");
+ _logger.Debug($"Voice reply queued: pending={_pendingAssistantReplies.Count}");
if (!_replyPlaybackLoopActive)
{
@@ -1334,7 +1334,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
}
catch (OperationCanceledException)
{
- _logger.Info($"Voice reply playback canceled: remainingQueue={CurrentStatus.PendingReplyCount}");
+ _logger.Debug($"Voice reply playback canceled: remainingQueue={CurrentStatus.PendingReplyCount}");
}
catch (Exception ex)
{
@@ -1365,7 +1365,7 @@ private async Task ProcessQueuedAssistantRepliesAsync()
if (shouldPauseBeforeNextReply)
{
- _logger.Info($"Voice reply playback paused before next queued response ({QueuedReplyPlaybackGap.TotalMilliseconds}ms)");
+ _logger.Debug($"Voice reply playback paused before next queued response ({QueuedReplyPlaybackGap.TotalMilliseconds}ms)");
await Task.Delay(QueuedReplyPlaybackGap);
}
}
@@ -1436,7 +1436,7 @@ private async Task SpeakTextAsync(string text, CancellationToken cancellationTok
var stopwatch = Stopwatch.StartNew();
using var stream = await synthesizer.SynthesizeTextToStreamAsync(text);
- _logger.Info($"Windows TTS latency: total={stopwatch.ElapsedMilliseconds}ms");
+ _logger.Debug($"Windows TTS latency: total={stopwatch.ElapsedMilliseconds}ms");
await PlayStreamAsync(player, stream, stream.ContentType, cancellationToken);
}
@@ -1650,6 +1650,20 @@ internal static bool ShouldRepromptAfterIncompleteRecognition(
!usedFallbackTranscript;
}
+ internal static bool ShouldWarnForRecognitionCompletion(
+ SpeechRecognitionResultStatus status,
+ bool rebuildRecognizer)
+ {
+ if (rebuildRecognizer)
+ {
+ return false;
+ }
+
+ return status != SpeechRecognitionResultStatus.Success &&
+ status != SpeechRecognitionResultStatus.TimeoutExceeded &&
+ status != SpeechRecognitionResultStatus.UserCanceled;
+ }
+
private static string CreateReplyPreview(string text)
{
var trimmed = text.Trim();
@@ -1809,15 +1823,27 @@ private async void OnSpeechRecognitionCompleted(
!string.IsNullOrWhiteSpace(fallbackText));
}
- _logger.Warn(
- $"Speech recognition session completed with status {args.Status}; restart={shouldRestart} ({restartDecisionReason}); rebuild={shouldRebuildRecognizer} ({rebuildDecisionReason}); hadActivity={sessionHadActivity}; hadCaptureSignal={sessionHadCaptureSignal}");
+ var completionMessage =
+ $"Speech recognition session completed with status {args.Status}; restart={shouldRestart} ({restartDecisionReason}); rebuild={shouldRebuildRecognizer} ({rebuildDecisionReason}); hadActivity={sessionHadActivity}; hadCaptureSignal={sessionHadCaptureSignal}";
+ if (ShouldWarnForRecognitionCompletion(args.Status, shouldRebuildRecognizer))
+ {
+ _logger.Warn(completionMessage);
+ }
+ else if (shouldRebuildRecognizer || args.Status == SpeechRecognitionResultStatus.UserCanceled)
+ {
+ _logger.Info(completionMessage);
+ }
+ else
+ {
+ _logger.Debug(completionMessage);
+ }
if (!string.IsNullOrWhiteSpace(fallbackText) &&
!_awaitingReply &&
!_isSpeaking &&
!token.IsCancellationRequested)
{
- _logger.Warn(
+ _logger.Info(
$"Voice recognition completed without a final result; promoting recent hypothesis as fallback transcript: {fallbackText}");
await HandleRecognizedTextAsync(fallbackText);
return;
@@ -1830,7 +1856,7 @@ private async void OnSpeechRecognitionCompleted(
if (shouldReprompt)
{
- _logger.Warn("Voice recognition session ended after speech activity but without a usable transcript; prompting user to repeat.");
+ _logger.Info("Voice recognition session ended after speech activity but without a usable transcript; prompting user to repeat.");
QueueAssistantReplyForPlayback(LowConfidenceRepeatPrompt, sessionKey, out _);
return;
}
@@ -2133,7 +2159,7 @@ private async void OnDefaultAudioCaptureDeviceChanged(object sender, DefaultAudi
return;
}
- _logger.Info(
+ _logger.Debug(
$"Default capture device changed to {newDeviceId ?? "(unknown)"}; refreshing TalkMode recognizer");
if (shouldRestartListening)
@@ -2214,7 +2240,7 @@ private async Task RebuildSpeechRecognizerAsync(string reason, CancellationToken
newRecognizer = null;
}
- _logger.Warn($"Speech recognizer rebuilt ({reason})");
+ _logger.Info($"Speech recognizer rebuilt ({reason})");
}
finally
{
@@ -2252,7 +2278,7 @@ private async Task RebuildVoiceCaptureAsync(string reason, CancellationToken can
cancellationToken.ThrowIfCancellationRequested();
await captureService.StartAsync(settings, cancellationToken);
- _logger.Info($"Voice capture graph rebuilt ({reason})");
+ _logger.Debug($"Voice capture graph rebuilt ({reason})");
}
private VoiceStatusInfo BuildStoppedStatus(string? sessionKey, string? reason)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
index 27d9e5c..248e0a3 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/WindowsMediaSpeechToTextRoute.cs
@@ -48,7 +48,7 @@ public async Task<SpeechRecognizer> CreateRecognizerAsync(VoiceSettings settings
throw new InvalidOperationException($"Speech recognizer unavailable: {compilation.Status}");
}
- _logger.Info(
+ _logger.Debug(
$"Speech recognizer compiled successfully ({compilation.Status}); endSilenceMs={recognizer.Timeouts.EndSilenceTimeout.TotalMilliseconds:0}; initialSilenceMs={recognizer.Timeouts.InitialSilenceTimeout.TotalMilliseconds:0}; babbleMs={recognizer.Timeouts.BabbleTimeout.TotalMilliseconds:0}");
return recognizer;
}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index ad863c8..6c59b97 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -31,9 +31,9 @@ public sealed partial class WebChatWindow : WindowEx
public bool IsClosed { get; private set; }
- private sealed record VoiceConversationTurnMirror(string Direction, string Text);
+ internal sealed record VoiceConversationTurnMirror(string Direction, string Text);
-private const string TrayVoiceIntegrationScript = """
+internal const string TrayVoiceIntegrationScript = """
(() => {
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
@@ -63,17 +63,39 @@ private sealed record VoiceConversationTurnMirror(string Direction, string Text)
}
};
let desiredTurns = [];
- const ensureTurnsHost = () => {
- if (!document.body) return null;
- let host = document.getElementById('openclaw-tray-voice-turns');
- if (host) return host;
- host = document.createElement('div');
- host.id = 'openclaw-tray-voice-turns';
+ const getTurnsAnchor = () => {
+ const composer = findComposer();
+ if (!composer) return null;
+ return composer.closest('form, footer, [role="form"], [data-slot="composer"]') || composer.parentElement || composer;
+ };
+ const applyInlineHostLayout = (host) => {
+ Object.assign(host.style, {
+ position: 'relative',
+ left: 'auto',
+ right: 'auto',
+ bottom: 'auto',
+ width: '100%',
+ maxWidth: '100%',
+ margin: '0 0 12px 0',
+ padding: '0',
+ zIndex: 'auto',
+ display: 'flex',
+ flexDirection: 'column',
+ gap: '8px',
+ pointerEvents: 'none',
+ alignItems: 'stretch'
+ });
+ };
+ const applyFallbackHostLayout = (host) => {
Object.assign(host.style, {
position: 'fixed',
left: '16px',
right: '16px',
bottom: '88px',
+ width: 'auto',
+ maxWidth: 'none',
+ margin: '0',
+ padding: '0',
zIndex: '2147483000',
display: 'flex',
flexDirection: 'column',
@@ -81,7 +103,28 @@ private sealed record VoiceConversationTurnMirror(string Direction, string Text)
pointerEvents: 'none',
alignItems: 'stretch'
});
- document.body.appendChild(host);
+ };
+ const ensureTurnsHost = () => {
+ if (!document.body) return null;
+ let host = document.getElementById('openclaw-tray-voice-turns');
+ if (!host) {
+ host = document.createElement('div');
+ host.id = 'openclaw-tray-voice-turns';
+ host.setAttribute('data-openclaw-tray-voice-turns', 'true');
+ host.setAttribute('aria-live', 'polite');
+ }
+ const anchor = getTurnsAnchor();
+ if (anchor && anchor.parentElement) {
+ applyInlineHostLayout(host);
+ if (host.parentElement !== anchor.parentElement || host.nextSibling !== anchor) {
+ anchor.parentElement.insertBefore(host, anchor);
+ }
+ return host;
+ }
+ applyFallbackHostLayout(host);
+ if (host.parentElement !== document.body) {
+ document.body.appendChild(host);
+ }
return host;
};
const renderTurns = () => {
@@ -196,9 +239,22 @@ private sealed record VoiceConversationTurnMirror(string Direction, string Text)
})();
""";
+ internal static string BuildSetStripInjectedMemoriesScript(bool enabled)
+ => $"window.__openClawTrayVoice?.setStripInjectedMemories?.({(enabled ? "true" : "false")});";
+
+ internal static string BuildDraftScript(string? draft)
+ {
+ return string.IsNullOrWhiteSpace(draft)
+ ? "window.__openClawTrayVoice?.clearDraft?.();"
+ : $"window.__openClawTrayVoice?.setDraft?.({JsonSerializer.Serialize(draft)});";
+ }
+
+ internal static string BuildTurnsScript(IReadOnlyCollection<VoiceConversationTurnMirror> turns)
+ => $"window.__openClawTrayVoice?.setTurns?.({JsonSerializer.Serialize(turns)});";
+
public WebChatWindow(string gatewayUrl, string token, bool stripInjectedMemories)
{
- Logger.Info($"WebChatWindow: Constructor called, gateway={gatewayUrl}");
+ Logger.Debug($"WebChatWindow: Constructor called, gateway={gatewayUrl}");
_gatewayUrl = gatewayUrl;
_token = token;
_stripInjectedMemories = stripInjectedMemories;
@@ -214,7 +270,7 @@ public WebChatWindow(string gatewayUrl, string token, bool stripInjectedMemories
Closed += OnWindowClosed;
- Logger.Info("WebChatWindow: Starting InitializeWebViewAsync");
+ Logger.Debug("WebChatWindow: Starting InitializeWebViewAsync");
_ = InitializeWebViewAsync();
}
@@ -236,7 +292,7 @@ private async Task InitializeWebViewAsync()
{
try
{
- Logger.Info("WebChatWindow: Initializing WebView2...");
+ Logger.Debug("WebChatWindow: Initializing WebView2...");
// Set up user data folder for WebView2
var userDataFolder = Path.Combine(
@@ -244,14 +300,14 @@ private async Task InitializeWebViewAsync()
"OpenClawTray", "WebView2");
Directory.CreateDirectory(userDataFolder);
- Logger.Info($"WebChatWindow: User data folder: {userDataFolder}");
+ Logger.Debug($"WebChatWindow: User data folder: {userDataFolder}");
// Set environment variable for user data folder
Environment.SetEnvironmentVariable("WEBVIEW2_USER_DATA_FOLDER", userDataFolder);
- Logger.Info("WebChatWindow: Calling EnsureCoreWebView2Async...");
+ Logger.Debug("WebChatWindow: Calling EnsureCoreWebView2Async...");
await WebView.EnsureCoreWebView2Async();
- Logger.Info("WebChatWindow: CoreWebView2 initialized successfully");
+ Logger.Debug("WebChatWindow: CoreWebView2 initialized successfully");
// Configure WebView2
WebView.CoreWebView2.Settings.IsStatusBarEnabled = false;
@@ -262,7 +318,7 @@ private async Task InitializeWebViewAsync()
// Handle navigation events (store for cleanup)
_navigationCompletedHandler = (s, e) =>
{
- Logger.Info($"WebChatWindow: Navigation completed, success={e.IsSuccess}, status={e.WebErrorStatus}");
+ Logger.Debug($"WebChatWindow: Navigation completed, success={e.IsSuccess}, status={e.WebErrorStatus}");
LoadingRing.IsActive = false;
LoadingRing.Visibility = Visibility.Collapsed;
_ = RefreshTrayVoiceDomStateAsync();
@@ -292,7 +348,7 @@ private async Task InitializeWebViewAsync()
{
// Strip query params to avoid logging tokens
var safeUri = e.Uri?.Split('?')[0] ?? "unknown";
- Logger.Info($"WebChatWindow: Navigation starting to {safeUri}");
+ Logger.Debug($"WebChatWindow: Navigation starting to {safeUri}");
LoadingRing.IsActive = true;
LoadingRing.Visibility = Visibility.Visible;
};
@@ -391,7 +447,7 @@ private void NavigateToChat()
// If debug URL is set, use it instead of gateway
if (!string.IsNullOrEmpty(DEBUG_TEST_URL))
{
- Logger.Info($"WebChatWindow: DEBUG MODE - Navigating to test URL: {DEBUG_TEST_URL}");
+ Logger.Debug($"WebChatWindow: DEBUG MODE - Navigating to test URL: {DEBUG_TEST_URL}");
WebView.CoreWebView2.Navigate(DEBUG_TEST_URL);
return;
}
@@ -404,7 +460,7 @@ private void NavigateToChat()
}
var safeBaseUrl = url.Split('?')[0];
- Logger.Info($"WebChatWindow: Navigating to {safeBaseUrl} (token hidden)");
+ Logger.Debug($"WebChatWindow: Navigating to {safeBaseUrl} (token hidden)");
WebView.CoreWebView2.Navigate(url);
}
@@ -482,20 +538,14 @@ private async Task RefreshTrayVoiceDomStateAsync()
try
{
- var stripJson = _stripInjectedMemories ? "true" : "false";
await WebView.CoreWebView2.ExecuteScriptAsync(
- $"window.__openClawTrayVoice?.setStripInjectedMemories?.({stripJson});");
-
- var draftJson = JsonSerializer.Serialize(_pendingVoiceDraft ?? string.Empty);
- var script = string.IsNullOrWhiteSpace(_pendingVoiceDraft)
- ? "window.__openClawTrayVoice?.clearDraft?.();"
- : $"window.__openClawTrayVoice?.setDraft?.({draftJson});";
+ BuildSetStripInjectedMemoriesScript(_stripInjectedMemories));
- await WebView.CoreWebView2.ExecuteScriptAsync(script);
+ await WebView.CoreWebView2.ExecuteScriptAsync(
+ BuildDraftScript(_pendingVoiceDraft));
- var turnsJson = JsonSerializer.Serialize(_pendingVoiceTurns);
await WebView.CoreWebView2.ExecuteScriptAsync(
- $"window.__openClawTrayVoice?.setTurns?.({turnsJson});");
+ BuildTurnsScript(_pendingVoiceTurns));
}
catch (Exception ex)
{
diff --git a/tests/OpenClaw.Shared.Tests/CapabilityTests.cs b/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
index 89158ea..1bc23ce 100644
--- a/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
+++ b/tests/OpenClaw.Shared.Tests/CapabilityTests.cs
@@ -1132,4 +1132,33 @@ public async Task Start_ReturnsError_WhenHandlerMissing()
Assert.False(res.Ok);
Assert.Contains("not available", res.Error, StringComparison.OrdinalIgnoreCase);
}
+
+ [Fact]
+ public async Task LegacyVoiceSkipCommand_RemainsAccepted()
+ {
+ var cap = new VoiceCapability(NullLogger.Instance);
+ VoiceSkipArgs? received = null;
+ cap.SkipRequested += args =>
+ {
+ received = args;
+ return Task.FromResult(new VoiceStatusInfo
+ {
+ Available = true,
+ Running = true,
+ Mode = VoiceActivationMode.TalkMode,
+ State = VoiceRuntimeState.PlayingResponse
+ });
+ };
+
+ var res = await cap.ExecuteAsync(new NodeInvokeRequest
+ {
+ Id = "voice8",
+ Command = "voice.skip",
+ Args = Parse("""{"reason":"legacy caller"}""")
+ });
+
+ Assert.True(res.Ok);
+ Assert.NotNull(received);
+ Assert.Equal("legacy caller", received!.Reason);
+ }
}
diff --git a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
index 37093be..2f3323e 100644
--- a/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
+++ b/tests/OpenClaw.Shared.Tests/VoiceModeSchemaTests.cs
@@ -18,7 +18,7 @@ public void All_ContainsExpectedCommandsInStableOrder()
"voice.stop",
"voice.pause",
"voice.resume",
- "voice.skip"
+ "voice.response.skip"
],
VoiceCommands.All);
}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
index a1fa893..3a01919 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceServiceTransportTests.cs
@@ -235,6 +235,26 @@ public void DescribeRecognitionCompletionRebuildDecision_ExplainsWhyRebuildIsBlo
Assert.Equal(expected, result);
}
+ [Theory]
+ [InlineData(SpeechRecognitionResultStatus.Success, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.TimeoutExceeded, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, false, false)]
+ [InlineData(SpeechRecognitionResultStatus.UserCanceled, true, false)]
+ [InlineData(SpeechRecognitionResultStatus.GrammarCompilationFailure, false, true)]
+ public void ShouldWarnForRecognitionCompletion_OnlyWarnsForUnexpectedStatuses(
+ SpeechRecognitionResultStatus status,
+ bool rebuildRecognizer,
+ bool expected)
+ {
+ var method = typeof(VoiceService).GetMethod(
+ "ShouldWarnForRecognitionCompletion",
+ BindingFlags.NonPublic | BindingFlags.Static)!;
+
+ var result = (bool)method.Invoke(null, [status, rebuildRecognizer])!;
+
+ Assert.Equal(expected, result);
+ }
+
[Theory]
[InlineData(16000, 80, 1280)]
[InlineData(16000, 0, 1280)]
diff --git a/tests/OpenClaw.Tray.Tests/WebChatWindowDomBridgeTests.cs b/tests/OpenClaw.Tray.Tests/WebChatWindowDomBridgeTests.cs
new file mode 100644
index 0000000..c8797af
--- /dev/null
+++ b/tests/OpenClaw.Tray.Tests/WebChatWindowDomBridgeTests.cs
@@ -0,0 +1,36 @@
+using OpenClawTray.Windows;
+
+namespace OpenClaw.Tray.Tests;
+
+public class WebChatWindowDomBridgeTests
+{
+ [Fact]
+ public void BuildDraftScript_ClearsWhenDraftIsBlank()
+ {
+ var script = WebChatWindow.BuildDraftScript(string.Empty);
+
+ Assert.Equal("window.__openClawTrayVoice?.clearDraft?.();", script);
+ }
+
+ [Fact]
+ public void BuildTurnsScript_SerializesOutgoingTurns()
+ {
+ var turns = new[]
+ {
+ new WebChatWindow.VoiceConversationTurnMirror("outgoing", "hello from voice")
+ };
+
+ var script = WebChatWindow.BuildTurnsScript(turns);
+
+ Assert.Contains("setTurns", script);
+ Assert.Contains("\"direction\":\"outgoing\"", script);
+ Assert.Contains("\"text\":\"hello from voice\"", script);
+ }
+
+ [Fact]
+ public void VoiceIntegrationScript_AnchorsTurnsBesideComposer()
+ {
+ Assert.Contains("getTurnsAnchor", WebChatWindow.TrayVoiceIntegrationScript);
+ Assert.Contains("insertBefore(host, anchor)", WebChatWindow.TrayVoiceIntegrationScript);
+ }
+}
From 89ccb0800eea4f5437b9d4b0ea276202af9e145e Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Sat, 28 Mar 2026 09:29:07 +0000
Subject: [PATCH 78/83] Add compact voice repeater window
---
.gitignore | 6 +
docs/VOICE-MODE.md | 46 +-
src/OpenClaw.Shared/SettingsData.cs | 11 +
src/OpenClaw.Tray.WinUI/App.xaml.cs | 82 ++-
.../Services/SettingsManager.cs | 10 +-
.../Services/Voice/VoiceChatCoordinator.cs | 37 +-
.../Windows/VoiceRepeaterWindow.xaml | 152 ++++++
.../Windows/VoiceRepeaterWindow.xaml.cs | 467 ++++++++++++++++++
.../VoiceChatCoordinatorTests.cs | 26 +
9 files changed, 802 insertions(+), 35 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml
create mode 100644 src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs
diff --git a/.gitignore b/.gitignore
index 9ed19c0..0c4e131 100644
--- a/.gitignore
+++ b/.gitignore
@@ -345,3 +345,9 @@ MigrationBackup/
# Fody - auto-generated XML schema
FodyWeavers.xsd
Output/
+
+# Repo-local tool caches and workspace metadata
+.claude/
+.dotnet-cli/
+.playwright-cli/
+output/playwright/
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index a97d024..77c6bb2 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -54,19 +54,20 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- local speech recognition turns that audio into transcript text on the active STT route
- interim hypotheses are surfaced live, but only final `Medium` or `High` confidence recognizer results are submitted
- if speech activity ends without any usable final transcript surviving, Talk Mode now clears the draft and gives a short local repeat prompt instead of silently doing nothing
-- the tray chat window, when open, mirrors the live transcript draft locally
+- the compact voice repeater window, when open, shows the live transcript draft plus local sent/received turns in a single scrolling surface
+- the tray chat window, when open, mirrors the live transcript draft into the compose box only
- the finalized transcript is always sent to OpenClaw via direct `chat.send` on the main session
- OpenClaw returns the assistant reply as normal chat output
- the node performs local or remote TTS playback of that reply
- assistant replies are queued locally and spoken sequentially, with a short (500 ms currently) pause between queued replies so overlapping responses are not lost
- if a reply arrives after the normal 45-second wait timeout, the tray still accepts and speaks that late reply for a short bounded grace window so slow upstream responses are not silently lost
-- the tray chat window can optionally strip injected `<relevant-memories>...</relevant-memories>` blocks from the rendered display without changing the underlying upstream message
+- the tray chat window can optionally strip injected `<relevant-memories>...</relevant-memories>` blocks from mirrored draft text before that draft is injected into the compose box
To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, and therefore not part of this design.
-The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection exists to carry assistant chat events for `TalkMode`, and to provide a fallback direct `chat.send` path when the tray chat window is not open.
+The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection carries assistant chat events for `TalkMode`, while the recognized transcript is always sent through the tray app's direct `chat.send` path.
## Voice APIs
@@ -127,7 +128,7 @@ The tray app also exposes in-process interfaces so its own windows do not need t
- `IVoiceRuntime`
- transcript draft and conversation events for chat integration
-This is the intended base for future surfaces such as the compact voice strip.
+This now powers multiple tray-local voice surfaces, including the compact voice repeater window.
### Can the Settings Form Use This API?
@@ -195,7 +196,20 @@ That produced two UX problems:
The tray app keeps a tray-local interim transcript buffer for the current utterance, independent of whether the chat window is open.
-The embedded [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs) owns the tray-local chat integration layer:
+The tray now has two separate local voice surfaces:
+
+- [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs)
+ - mirrors the live transcript draft into the real WebChat compose box when WebChat is open
+ - optionally strips `<relevant-memories>...</relevant-memories>` from that mirrored draft text before injection
+ - does not own transport
+ - does not try to fake typed-chat parity for sent voice turns
+- [VoiceRepeaterWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs)
+ - is the compact tray-local voice surface
+ - shows live transcript, outgoing sent text, and incoming replies in one scrollable transcript strip
+ - exposes compact pause/resume, skip, mic re-arm/reset, and local repeater settings controls
+ - can open the older Voice Status window for deeper troubleshooting/status detail
+
+The embedded WebChat surface therefore only owns draft mirroring:
- interim STT hypotheses from Windows speech recognition are injected into the tray chat compose box while the user is speaking
- if the chat window opens during an utterance, the current buffered transcript is copied into the compose box immediately
@@ -205,8 +219,6 @@ The embedded [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatW
This is intentionally a tray-local integration decision, not a protocol-level rewrite of the stored upstream transcript.
-It also fits with the planned voice mode *repeater form*, which will act as an optional small display and control surface whilst voice mode is in operation.
-
### Tradeoffs
- preserves a single visible conversation for the user
@@ -214,6 +226,7 @@ It also fits with the planned voice mode *repeater form*, which will act as an o
- uses only one send path for voice turns, which is simpler to reason about and debug
- requires us to change the existing project, but not too significantly
- keeps a light DOM integration inside the embedded WebView chat surface for draft mirroring only
+- gives voice mode a separate compact surface that does not depend on WebChat DOM behavior for basic transcript/reply visibility
- only affects the tray app chat window; other clients still render upstream content according to their own rules
## Provider Selection
@@ -592,11 +605,13 @@ The current Windows implementation is still centred on `VoiceService`, with a fe
- `VoiceService`
owns Talk Mode runtime state, recognizer/TTS integration, reply queuing, timeouts, gateway reply handling, and the transition layer between `AudioGraph` capture and the current recognizer-owned STT path
- `VoiceChatCoordinator`
- mirrors interim transcript drafts and conversation turns into the tray UI without making the chat window part of the transport path
+ mirrors interim transcript drafts and conversation turns into attached tray windows without making any window part of the transport path
- `OpenClawGatewayClient`
carries direct `chat.send`, final chat events, and the `sessions.preview` fallback path for bare final markers
- `WebChatWindow`
- mirrors live transcript drafts locally and optionally strips injected `<relevant-memories>` blocks from rendered chat text
+ mirrors live transcript drafts into the WebChat compose box and can strip injected `<relevant-memories>` blocks from mirrored draft text before injection
+- `VoiceRepeaterWindow`
+ is the compact local transcript/reply/control surface for Talk Mode
### Current End-to-End Talk Mode
@@ -611,18 +626,23 @@ flowchart LR
C --> F["HypothesisGenerated<br/>interim text"]
F --> G["VoiceService<br/>draft event"]
G --> H["VoiceChatCoordinator"]
- H --> I["WebChatWindow<br/>local compose-box mirror only"]
+ H --> I["WebChatWindow<br/>compose-box mirror only"]
+ H --> I2["VoiceRepeaterWindow<br/>compact local draft surface"]
C --> J["ResultGenerated<br/>final Medium/High text"]
J --> K["VoiceService<br/>duplicate guard + late hypothesis promotion"]
K --> L["Stop recognition session"]
L --> M["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
M --> N["OpenClaw / session pipeline"]
+ K --> H2["VoiceChatCoordinator<br/>outgoing turn event"]
+ H2 --> I2
N --> O["Chat final event"]
O --> P{"assistant text present?"}
P -- "yes" --> Q["assistant text"]
P -- "no" --> R["sessions.preview fallback<br/>with stale-preview retry guard"]
R --> Q
+ Q --> H3["VoiceChatCoordinator<br/>incoming turn event"]
+ H3 --> I2
Q --> S["VoiceService reply queue"]
S --> T{"TTS provider"}
@@ -642,7 +662,7 @@ flowchart LR
| 1 | `VoiceCaptureService` | selected/default microphone device | continuous frame and signal events from `AudioGraph` |
| 2 | `SpeechRecognizer` | Windows default speech-input path | interim/final transcript text |
| 3 | `VoiceService` | capture signal + final transcript text | health/restart decisions, de-duplicated transcript, runtime state changes |
-| 4 | `VoiceChatCoordinator` | interim/final draft events | mirrored tray chat compose text |
+| 4 | `VoiceChatCoordinator` | draft and conversation-turn events | mirrored draft for WebChat plus compact local transcript/reply updates |
| 5 | `OpenClawGatewayClient` | transcript text + session key | `chat.send` request + assistant reply events |
| 6 | `OpenClawGatewayClient` preview fallback | bare final chat marker | assistant preview text, guarded against stale replay |
| 7 | `VoiceService` reply queue | assistant reply text | ordered reply playback work |
@@ -831,8 +851,8 @@ Status values used below:
|---|---|---|
| Talk Mode continuous loop (`listen -> chat.send(main) -> wait -> speak`) | `Supported` | Windows Talk Mode uses direct `chat.send` on the active main session and loops back to listening after reply playback. |
| Talk Mode sends after a short silence window | `Supported` | The current runtime finalizes on recognition pause and uses configurable Talk Mode silence settings. |
-| Talk Mode visible phase transitions (`Listening -> Thinking -> Speaking`) | `Partial` | Runtime states and tray icon changes exist, but there is no always-visible overlay yet. |
-| Talk Mode always-on overlay with click-to-stop / click-X controls | `NotSupported (planned)` | Windows currently has a tray icon, status window, and draft mirroring, but no overlay surface. |
+| Talk Mode visible phase transitions (`Listening -> Thinking -> Speaking`) | `Partial` | Runtime states, tray icon changes, and the compact voice repeater window exist, but there is no always-visible overlay yet. |
+| Talk Mode always-on overlay with click-to-stop / click-X controls | `NotSupported (planned)` | Windows currently has a tray icon, a manually-opened compact repeater window, and WebChat draft mirroring, but no always-on overlay surface. |
| Talk Mode writes replies into WebChat the same way typed chat does | `Partial` | Replies appear in WebChat through normal session updates, but Talk Mode uses direct send rather than a same-as-typing transport path. |
| Talk Mode interrupt-on-speech / barge-in | `NotSupported (planned)` | Windows is still half-duplex during reply playback. |
| Talk Mode voice directives in replies | `NotSupported (planned)` | Windows does not yet parse or apply the JSON voice directive line described in the Talk Mode docs. |
diff --git a/src/OpenClaw.Shared/SettingsData.cs b/src/OpenClaw.Shared/SettingsData.cs
index 6bcfcd9..4dbd06b 100644
--- a/src/OpenClaw.Shared/SettingsData.cs
+++ b/src/OpenClaw.Shared/SettingsData.cs
@@ -29,6 +29,7 @@ public class SettingsData
public bool PreferStructuredCategories { get; set; } = true;
public List<UserNotificationRule>? UserRules { get; set; }
public VoiceSettings Voice { get; set; } = new();
+ public VoiceRepeaterWindowSettings VoiceRepeaterWindow { get; set; } = new();
public VoiceProviderConfigurationStore VoiceProviderConfiguration { get; set; } = new();
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public VoiceProviderCredentials? VoiceProviderCredentials { get; set; }
@@ -65,3 +66,13 @@ private static string MigrateLegacyVoiceJson(string json)
.Replace("\"State\": \"ListeningForWakeWord\"", "\"State\": \"ListeningForVoiceWake\"", StringComparison.Ordinal);
}
}
+
+public sealed class VoiceRepeaterWindowSettings
+{
+ public bool AutoScroll { get; set; } = true;
+ public double TextSize { get; set; } = 13;
+ public int? Width { get; set; }
+ public int? Height { get; set; }
+ public int? X { get; set; }
+ public int? Y { get; set; }
+}
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index 818847e..b5cd0cd 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -66,6 +66,7 @@ public partial class App : Application
// Windows (created on demand)
private SettingsWindow? _settingsWindow;
+ private VoiceRepeaterWindow? _voiceRepeaterWindow;
private VoiceModeWindow? _voiceModeWindow;
private WebChatWindow? _webChatWindow;
private StatusDetailWindow? _statusDetailWindow;
@@ -1697,10 +1698,12 @@ private VoiceTrayIconState GetVoiceTrayIconState()
return voiceStatus.State switch
{
VoiceRuntimeState.PlayingResponse => VoiceTrayIconState.Speaking,
+ VoiceRuntimeState.ListeningForVoiceWake => VoiceTrayIconState.Listening,
+ VoiceRuntimeState.ListeningContinuously => VoiceTrayIconState.Listening,
VoiceRuntimeState.RecordingUtterance => VoiceTrayIconState.Listening,
VoiceRuntimeState.Paused => VoiceTrayIconState.Off,
_ when voiceStatus.Mode == VoiceActivationMode.Off => VoiceTrayIconState.Off,
- _ => VoiceTrayIconState.Armed
+ _ => VoiceTrayIconState.Off
};
}
@@ -1731,17 +1734,59 @@ private void ShowVoiceModeSettings()
if (_settings == null || _voiceService == null)
return;
+ if (_voiceRepeaterWindow == null || _voiceRepeaterWindow.IsClosed)
+ {
+ _voiceRepeaterWindow = new VoiceRepeaterWindow(_settings, _voiceService);
+ _voiceRepeaterWindow.OpenVoiceStatusRequested += OnOpenVoiceStatusRequested;
+ _voiceRepeaterWindow.Closed += (s, e) =>
+ {
+ _voiceChatCoordinator?.DetachWindow(_voiceRepeaterWindow);
+ _voiceRepeaterWindow.OpenVoiceStatusRequested -= OnOpenVoiceStatusRequested;
+ _voiceRepeaterWindow = null;
+ };
+ _voiceChatCoordinator?.AttachWindow(_voiceRepeaterWindow);
+ }
+
+ _voiceRepeaterWindow.RefreshStatus();
+ _voiceRepeaterWindow.Activate();
+ }
+
+ private void ShowVoiceStatusWindow()
+ {
+ if (_settings == null || _voiceService == null)
+ {
+ return;
+ }
+
if (_voiceModeWindow == null || _voiceModeWindow.IsClosed)
{
_voiceModeWindow = new VoiceModeWindow(_settings, _voiceService, _voiceService);
- _voiceModeWindow.OpenSettingsRequested += (s, e) => ShowSettings();
- _voiceModeWindow.Closed += (s, e) => _voiceModeWindow = null;
+ _voiceModeWindow.OpenSettingsRequested += OnVoiceModeOpenSettingsRequested;
+ _voiceModeWindow.Closed += (s, e) =>
+ {
+ if (_voiceModeWindow != null)
+ {
+ _voiceModeWindow.OpenSettingsRequested -= OnVoiceModeOpenSettingsRequested;
+ }
+
+ _voiceModeWindow = null;
+ };
}
_voiceModeWindow.RefreshStatus();
_voiceModeWindow.Activate();
}
+ private void OnOpenVoiceStatusRequested(object? sender, EventArgs e)
+ {
+ ShowVoiceStatusWindow();
+ }
+
+ private void OnVoiceModeOpenSettingsRequested(object? sender, EventArgs e)
+ {
+ ShowSettings();
+ }
+
private async void OnSettingsSaved(object? sender, EventArgs e)
{
// Reconnect with new settings ΓÇö mirror the startup if/else pattern
@@ -1819,7 +1864,7 @@ private async void OnSettingsSaved(object? sender, EventArgs e)
_globalHotkey?.Unregister();
}
- if (_webChatWindow != null && ! _webChatWindow.IsClosed)
+ if (_webChatWindow != null && !_webChatWindow.IsClosed)
{
try
{
@@ -1831,6 +1876,9 @@ private async void OnSettingsSaved(object? sender, EventArgs e)
}
}
+ _voiceRepeaterWindow?.RefreshStatus();
+ _voiceModeWindow?.RefreshStatus();
+
// Update auto-start
AutoStartManager.SetAutoStart(_settings.AutoStart);
}
@@ -2105,6 +2153,7 @@ private async Task ToggleVoiceQuickPauseAsync()
try
{
var status = await _voiceService.ToggleQuickPauseAsync();
+ _voiceRepeaterWindow?.RefreshStatus();
_voiceModeWindow?.RefreshStatus();
ShowVoiceQuickToggleToast(status);
}
@@ -2384,6 +2433,15 @@ private void OnToastActivated(ToastNotificationActivatedEventArgsCompat args)
private void ExitApplication()
{
Logger.Info("Application exiting");
+
+ TryCloseWindow(_voiceRepeaterWindow);
+ TryCloseWindow(_voiceModeWindow);
+ TryCloseWindow(_webChatWindow);
+ TryCloseWindow(_settingsWindow);
+ TryCloseWindow(_statusDetailWindow);
+ TryCloseWindow(_notificationHistoryWindow);
+ TryCloseWindow(_activityStreamWindow);
+ TryCloseWindow(_quickSendDialog);
// Cancel background tasks
_deepLinkCts?.Cancel();
@@ -2418,6 +2476,22 @@ private void ExitApplication()
Exit();
}
+ private static void TryCloseWindow(Window? window)
+ {
+ if (window == null)
+ {
+ return;
+ }
+
+ try
+ {
+ window.Close();
+ }
+ catch
+ {
+ }
+ }
+
#endregion
private Microsoft.UI.Dispatching.DispatcherQueue? AppDispatcherQueue =>
diff --git a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
index 0320136..261ed9b 100644
--- a/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs
@@ -43,6 +43,7 @@ public class SettingsManager
public bool PreferStructuredCategories { get; set; } = true;
public List<OpenClaw.Shared.UserNotificationRule> UserRules { get; set; } = new();
public VoiceSettings Voice { get; set; } = new();
+ public VoiceRepeaterWindowSettings VoiceRepeaterWindow { get; set; } = new();
public VoiceProviderConfigurationStore VoiceProviderConfiguration { get; set; } = new();
// Node mode (enables Windows as a node, not just operator)
@@ -85,6 +86,7 @@ public void Load()
if (loaded.UserRules != null)
UserRules = loaded.UserRules;
Voice = loaded.Voice ?? new VoiceSettings();
+ VoiceRepeaterWindow = loaded.VoiceRepeaterWindow ?? new VoiceRepeaterWindowSettings();
VoiceProviderConfiguration = loaded.VoiceProviderConfiguration?.Clone() ?? new VoiceProviderConfigurationStore();
VoiceProviderConfiguration.MigrateLegacyCredentials(loaded.VoiceProviderCredentials);
}
@@ -96,7 +98,7 @@ public void Load()
}
}
- public void Save()
+ public void Save(bool logSuccess = true)
{
try
{
@@ -124,13 +126,17 @@ public void Save()
PreferStructuredCategories = PreferStructuredCategories,
UserRules = UserRules,
Voice = Voice,
+ VoiceRepeaterWindow = VoiceRepeaterWindow,
VoiceProviderConfiguration = VoiceProviderConfiguration.Clone()
};
var json = data.ToJson();
File.WriteAllText(SettingsFilePath, json);
- Logger.Info("Settings saved");
+ if (logSuccess)
+ {
+ Logger.Info("Settings saved");
+ }
}
catch (Exception ex)
{
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
index 9a9b52b..a3735d4 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceChatCoordinator.cs
@@ -12,7 +12,7 @@ public sealed class VoiceChatCoordinator : IDisposable
private readonly IUiDispatcher _dispatcher;
private readonly object _gate = new();
- private IVoiceChatWindow? _webChatWindow;
+ private readonly List<IVoiceChatWindow> _windows = [];
private string _voiceTranscriptDraftText = string.Empty;
private readonly List<VoiceConversationTurnEventArgs> _bufferedConversationTurns = [];
private bool _disposed;
@@ -36,12 +36,12 @@ public void AttachWindow(IVoiceChatWindow window)
lock (_gate)
{
- if (ReferenceEquals(_webChatWindow, window))
+ if (_windows.Contains(window))
{
return;
}
- _webChatWindow = window;
+ _windows.Add(window);
}
_ = window.UpdateVoiceTranscriptDraftAsync(
@@ -64,17 +64,18 @@ public void DetachWindow(IVoiceChatWindow? window)
{
lock (_gate)
{
- if (_webChatWindow == null)
+ if (_windows.Count == 0)
{
return;
}
- if (window != null && !ReferenceEquals(_webChatWindow, window))
+ if (window == null)
{
+ _windows.Clear();
return;
}
- _webChatWindow = null;
+ _windows.Remove(window);
}
}
@@ -95,7 +96,7 @@ private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationT
{
_dispatcher.TryEnqueue(() =>
{
- IVoiceChatWindow? window;
+ List<IVoiceChatWindow> windows;
lock (_gate)
{
_bufferedConversationTurns.Add(CloneTurn(args));
@@ -104,12 +105,15 @@ private void OnVoiceConversationTurnAvailable(object? sender, VoiceConversationT
_bufferedConversationTurns.RemoveAt(0);
}
- window = _webChatWindow;
+ windows = [.. _windows];
}
- if (window != null && !window.IsClosed)
+ foreach (var window in windows)
{
- _ = window.AppendVoiceConversationTurnAsync(args);
+ if (!window.IsClosed)
+ {
+ _ = window.AppendVoiceConversationTurnAsync(args);
+ }
}
ConversationTurnAvailable?.Invoke(this, args);
@@ -122,18 +126,19 @@ private void OnVoiceTranscriptDraftUpdated(object? sender, VoiceTranscriptDraftE
{
_voiceTranscriptDraftText = args.Clear ? string.Empty : (args.Text ?? string.Empty);
- IVoiceChatWindow? window;
+ List<IVoiceChatWindow> windows;
lock (_gate)
{
- window = _webChatWindow;
+ windows = [.. _windows];
}
- if (window == null || window.IsClosed)
+ foreach (var window in windows)
{
- return;
+ if (!window.IsClosed)
+ {
+ _ = window.UpdateVoiceTranscriptDraftAsync(_voiceTranscriptDraftText, args.Clear);
+ }
}
-
- _ = window.UpdateVoiceTranscriptDraftAsync(_voiceTranscriptDraftText, args.Clear);
});
}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml
new file mode 100644
index 0000000..7714ead
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml
@@ -0,0 +1,152 @@
+<?xml version="1.0" encoding="utf-8"?>
+<winex:WindowEx
+ x:Class="OpenClawTray.Windows.VoiceRepeaterWindow"
+ xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
+ xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
+ xmlns:winex="using:WinUIEx"
+ Title="Voice Mode"
+ MinWidth="320"
+ MinHeight="150">
+
+ <Window.SystemBackdrop>
+ <MicaBackdrop/>
+ </Window.SystemBackdrop>
+
+ <Grid x:Name="WindowRoot">
+ <Grid.RowDefinitions>
+ <RowDefinition Height="*"/>
+ <RowDefinition Height="Auto"/>
+ </Grid.RowDefinitions>
+
+ <Border Grid.Row="0"
+ Margin="4,4,4,0"
+ Background="{ThemeResource CardBackgroundFillColorDefaultBrush}"
+ BorderBrush="{ThemeResource CardStrokeColorDefaultBrush}"
+ BorderThickness="1"
+ CornerRadius="10"
+ Padding="6">
+ <Grid>
+ <TextBlock x:Name="EmptyConversationTextBlock"
+ Text="Transcript and replies appear here."
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ HorizontalAlignment="Center"
+ VerticalAlignment="Center"
+ TextWrapping="Wrap"/>
+
+ <ScrollViewer x:Name="ConversationScrollViewer"
+ VerticalScrollBarVisibility="Auto"
+ HorizontalScrollBarVisibility="Disabled">
+ <StackPanel Spacing="6">
+ <ItemsControl x:Name="ConversationItemsControl">
+ <ItemsControl.ItemTemplate>
+ <DataTemplate>
+ <StackPanel Spacing="1">
+ <TextBlock Text="{Binding Caption}"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ FontSize="{Binding CaptionFontSize}"/>
+ <TextBlock Text="{Binding Message}"
+ FontSize="{Binding MessageFontSize}"
+ TextWrapping="Wrap"
+ MaxLines="5"/>
+ </StackPanel>
+ </DataTemplate>
+ </ItemsControl.ItemTemplate>
+ </ItemsControl>
+
+ <StackPanel x:Name="DraftPanel"
+ Visibility="Collapsed"
+ Spacing="1">
+ <TextBlock x:Name="DraftCaptionTextBlock"
+ Text="You (draft)"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ FontSize="10"/>
+ <TextBlock x:Name="DraftTextBlock"
+ FontSize="13"
+ TextWrapping="Wrap"
+ FontStyle="Italic"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ MaxLines="4"/>
+ </StackPanel>
+ </StackPanel>
+ </ScrollViewer>
+ </Grid>
+ </Border>
+
+ <Border Grid.Row="1"
+ Margin="4"
+ Background="{ThemeResource CardBackgroundFillColorDefaultBrush}"
+ BorderBrush="{ThemeResource CardStrokeColorDefaultBrush}"
+ BorderThickness="1"
+ CornerRadius="10"
+ Padding="6,4">
+ <Grid ColumnSpacing="4">
+ <Grid.ColumnDefinitions>
+ <ColumnDefinition Width="*"/>
+ <ColumnDefinition Width="Auto"/>
+ </Grid.ColumnDefinitions>
+
+ <TextBlock x:Name="TroubleshootingTextBlock"
+ Grid.Column="0"
+ VerticalAlignment="Center"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"
+ Visibility="Collapsed"
+ MaxLines="2"
+ TextWrapping="Wrap"
+ FontSize="10"/>
+
+ <StackPanel Grid.Column="1"
+ Orientation="Horizontal"
+ Spacing="4">
+ <Button x:Name="PauseResumeButton"
+ Width="30"
+ Height="28"
+ Click="OnPauseResumeClick"
+ ToolTipService.ToolTip="Pause or resume voice mode">
+ <FontIcon x:Name="PauseResumeIcon" Glyph="&#xE769;" FontSize="11"/>
+ </Button>
+
+ <Button x:Name="SkipReplyButton"
+ Width="30"
+ Height="28"
+ Click="OnSkipReplyClick"
+ ToolTipService.ToolTip="Skip current reply">
+ <FontIcon Glyph="&#xE72A;" FontSize="11"/>
+ </Button>
+
+ <Button x:Name="ViewSettingsButton"
+ Width="30"
+ Height="28"
+ ToolTipService.ToolTip="Voice repeater settings">
+ <FontIcon Glyph="&#xE713;" FontSize="11"/>
+ <Button.Flyout>
+ <Flyout Placement="TopEdgeAlignedRight">
+ <StackPanel Width="220" Spacing="8" Padding="8">
+ <CheckBox x:Name="AutoScrollCheckBox"
+ Content="Auto-scroll"
+ Checked="OnAutoScrollChanged"
+ Unchecked="OnAutoScrollChanged"/>
+
+ <StackPanel Spacing="4">
+ <TextBlock Text="Text size"
+ Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
+ <ComboBox x:Name="TextSizeComboBox"
+ SelectionChanged="OnTextSizeSelectionChanged">
+ <ComboBoxItem Content="11 pt" Tag="11"/>
+ <ComboBoxItem Content="12 pt" Tag="12"/>
+ <ComboBoxItem Content="13 pt" Tag="13"/>
+ <ComboBoxItem Content="14 pt" Tag="14"/>
+ <ComboBoxItem Content="15 pt" Tag="15"/>
+ </ComboBox>
+ </StackPanel>
+
+ <Button Content="Open Voice Status"
+ Click="OnOpenVoiceStatusClick"/>
+ </StackPanel>
+ </Flyout>
+ </Button.Flyout>
+ </Button>
+ </StackPanel>
+ </Grid>
+ </Border>
+ </Grid>
+</winex:WindowEx>
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs
new file mode 100644
index 0000000..6ff8a31
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs
@@ -0,0 +1,467 @@
+using Microsoft.UI.Windowing;
+using Microsoft.UI.Dispatching;
+using Microsoft.UI.Xaml;
+using Microsoft.UI.Xaml.Controls;
+using OpenClaw.Shared;
+using OpenClawTray.Helpers;
+using OpenClawTray.Services;
+using OpenClawTray.Services.Voice;
+using System;
+using System.Collections.ObjectModel;
+using System.ComponentModel;
+using System.Runtime.CompilerServices;
+using System.Threading.Tasks;
+using Windows.Graphics;
+using WinUIEx;
+
+namespace OpenClawTray.Windows;
+
+public sealed partial class VoiceRepeaterWindow : WindowEx, IVoiceChatWindow
+{
+ private const int MaxConversationItems = 24;
+ private const int DefaultWidth = 360;
+ private const int DefaultHeight = 170;
+ private const double DefaultTextSize = 13;
+ private const double DefaultCaptionSize = 10;
+
+ private readonly SettingsManager _settings;
+ private readonly IVoiceRuntimeControlApi _voiceRuntimeControlApi;
+ private readonly ObservableCollection<ConversationItem> _conversationItems = [];
+ private readonly DispatcherQueueTimer? _refreshTimer;
+ private readonly DispatcherQueueTimer? _layoutSaveTimer;
+
+ private bool _controlActionInFlight;
+ private bool _suppressSettingsEvents;
+ private bool _autoScrollEnabled;
+ private double _messageFontSize = DefaultTextSize;
+ private double _captionFontSize = DefaultCaptionSize;
+
+ public bool IsClosed { get; private set; }
+
+ public event EventHandler? OpenVoiceStatusRequested;
+
+ public VoiceRepeaterWindow(
+ SettingsManager settings,
+ IVoiceRuntimeControlApi voiceRuntimeControlApi)
+ {
+ _settings = settings;
+ _voiceRuntimeControlApi = voiceRuntimeControlApi;
+ _autoScrollEnabled = _settings.VoiceRepeaterWindow.AutoScroll;
+
+ InitializeComponent();
+
+ Title = "Voice Mode";
+ ApplyStoredWindowPlacement();
+ this.SetIcon(IconHelper.GetStatusIconPath(ConnectionStatus.Connected));
+
+ ConversationItemsControl.ItemsSource = _conversationItems;
+
+ Closed += OnWindowClosed;
+
+ var dispatcherQueue = DispatcherQueue.GetForCurrentThread();
+ if (dispatcherQueue != null)
+ {
+ _refreshTimer = dispatcherQueue.CreateTimer();
+ _refreshTimer.Interval = TimeSpan.FromMilliseconds(400);
+ _refreshTimer.Tick += (_, _) => RefreshStatus();
+ _refreshTimer.Start();
+
+ _layoutSaveTimer = dispatcherQueue.CreateTimer();
+ _layoutSaveTimer.Interval = TimeSpan.FromMilliseconds(600);
+ _layoutSaveTimer.IsRepeating = false;
+ _layoutSaveTimer.Tick += (_, _) =>
+ {
+ _layoutSaveTimer.Stop();
+ SaveWindowPlacement();
+ };
+ }
+
+ if (AppWindow is not null)
+ {
+ AppWindow.Changed += OnAppWindowChanged;
+ }
+
+ ApplyViewSettings();
+ RefreshStatus();
+ UpdateConversationPlaceholder();
+ }
+
+ public void RefreshStatus()
+ {
+ var status = _voiceRuntimeControlApi.CurrentStatus;
+ ApplyStatus(status);
+ }
+
+ public Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
+ {
+ var draftText = clear ? string.Empty : (text ?? string.Empty);
+ DraftTextBlock.Text = draftText;
+ DraftPanel.Visibility = string.IsNullOrWhiteSpace(draftText)
+ ? Visibility.Collapsed
+ : Visibility.Visible;
+
+ UpdateConversationPlaceholder();
+ ScrollConversationToEnd();
+ return Task.CompletedTask;
+ }
+
+ public Task AppendVoiceConversationTurnAsync(VoiceConversationTurnEventArgs args)
+ {
+ if (args == null || string.IsNullOrWhiteSpace(args.Message))
+ {
+ return Task.CompletedTask;
+ }
+
+ var item = new ConversationItem(
+ args.Direction == VoiceConversationDirection.Outgoing ? "You" : "Assistant",
+ DateTime.Now.ToString("HH:mm:ss"),
+ args.Message,
+ _messageFontSize,
+ _captionFontSize);
+
+ _conversationItems.Add(item);
+ while (_conversationItems.Count > MaxConversationItems)
+ {
+ _conversationItems.RemoveAt(0);
+ }
+
+ UpdateConversationPlaceholder();
+ ScrollConversationToEnd();
+ return Task.CompletedTask;
+ }
+
+ private async void OnPauseResumeClick(object sender, RoutedEventArgs e)
+ {
+ if (_controlActionInFlight)
+ {
+ return;
+ }
+
+ _controlActionInFlight = true;
+ ApplyStatus(_voiceRuntimeControlApi.CurrentStatus);
+
+ try
+ {
+ var status = _voiceRuntimeControlApi.CurrentStatus;
+ if (status.State == VoiceRuntimeState.Paused)
+ {
+ await _voiceRuntimeControlApi.ResumeAsync(new VoiceResumeArgs { Reason = "Voice repeater resume button" });
+ }
+ else
+ {
+ await _voiceRuntimeControlApi.PauseAsync(new VoicePauseArgs { Reason = "Voice repeater pause button" });
+ }
+ }
+ finally
+ {
+ _controlActionInFlight = false;
+ RefreshStatus();
+ }
+ }
+
+ private async void OnSkipReplyClick(object sender, RoutedEventArgs e)
+ {
+ if (_controlActionInFlight || !_voiceRuntimeControlApi.CurrentStatus.CanSkipReply)
+ {
+ return;
+ }
+
+ _controlActionInFlight = true;
+ ApplyStatus(_voiceRuntimeControlApi.CurrentStatus);
+
+ try
+ {
+ await _voiceRuntimeControlApi.SkipCurrentReplyAsync(new VoiceSkipArgs
+ {
+ Reason = "Voice repeater skip button"
+ });
+ }
+ finally
+ {
+ _controlActionInFlight = false;
+ RefreshStatus();
+ }
+ }
+
+ private void OnAutoScrollChanged(object sender, RoutedEventArgs e)
+ {
+ if (_suppressSettingsEvents)
+ {
+ return;
+ }
+
+ _autoScrollEnabled = AutoScrollCheckBox.IsChecked == true;
+ _settings.VoiceRepeaterWindow.AutoScroll = _autoScrollEnabled;
+ _settings.Save(logSuccess: false);
+
+ if (_autoScrollEnabled)
+ {
+ ScrollConversationToEnd();
+ }
+ }
+
+ private void OnTextSizeSelectionChanged(object sender, SelectionChangedEventArgs e)
+ {
+ if (_suppressSettingsEvents || TextSizeComboBox.SelectedItem is not ComboBoxItem item)
+ {
+ return;
+ }
+
+ if (!double.TryParse(item.Tag?.ToString(), out var size))
+ {
+ return;
+ }
+
+ _settings.VoiceRepeaterWindow.TextSize = size;
+ ApplyViewSettings();
+ _settings.Save(logSuccess: false);
+ }
+
+ private void OnOpenVoiceStatusClick(object sender, RoutedEventArgs e)
+ {
+ OpenVoiceStatusRequested?.Invoke(this, EventArgs.Empty);
+ }
+
+ private void OnWindowClosed(object sender, WindowEventArgs e)
+ {
+ if (_refreshTimer != null)
+ {
+ _refreshTimer.Stop();
+ }
+
+ if (_layoutSaveTimer != null)
+ {
+ _layoutSaveTimer.Stop();
+ }
+
+ if (AppWindow is not null)
+ {
+ AppWindow.Changed -= OnAppWindowChanged;
+ }
+
+ SaveWindowPlacement();
+ IsClosed = true;
+ }
+
+ private void OnAppWindowChanged(AppWindow sender, AppWindowChangedEventArgs args)
+ {
+ if (args.DidPositionChange || args.DidSizeChange)
+ {
+ _layoutSaveTimer?.Stop();
+ _layoutSaveTimer?.Start();
+ }
+ }
+
+ private void ApplyStatus(VoiceStatusInfo status)
+ {
+ Title = $"Voice Mode ({GetWindowStateLabel(status)})";
+ DraftCaptionTextBlock.Text = status.State == VoiceRuntimeState.RecordingUtterance
+ ? "You (speaking)"
+ : "You (draft)";
+
+ if (string.IsNullOrWhiteSpace(status.LastError))
+ {
+ TroubleshootingTextBlock.Visibility = Visibility.Collapsed;
+ TroubleshootingTextBlock.Text = string.Empty;
+ }
+ else
+ {
+ TroubleshootingTextBlock.Visibility = Visibility.Visible;
+ TroubleshootingTextBlock.Text = status.LastError;
+ }
+
+ var paused = status.State == VoiceRuntimeState.Paused;
+ PauseResumeButton.IsEnabled = !_controlActionInFlight && status.Mode != VoiceActivationMode.Off;
+ PauseResumeIcon.Glyph = paused ? "\uE768" : "\uE769";
+ ToolTipService.SetToolTip(
+ PauseResumeButton,
+ paused ? "Resume voice mode" : "Pause voice mode");
+
+ SkipReplyButton.IsEnabled = !_controlActionInFlight && status.CanSkipReply;
+ }
+
+ private void ApplyStoredWindowPlacement()
+ {
+ var prefs = _settings.VoiceRepeaterWindow;
+ var width = prefs.Width.GetValueOrDefault(DefaultWidth);
+ var height = prefs.Height.GetValueOrDefault(DefaultHeight);
+
+ this.SetWindowSize(
+ Math.Max(width, 320),
+ Math.Max(height, 150));
+
+ if (prefs.X.HasValue && prefs.Y.HasValue)
+ {
+ AppWindow.Move(new PointInt32(prefs.X.Value, prefs.Y.Value));
+ }
+ else
+ {
+ this.CenterOnScreen();
+ }
+ }
+
+ private void ApplyViewSettings()
+ {
+ _suppressSettingsEvents = true;
+ try
+ {
+ _autoScrollEnabled = _settings.VoiceRepeaterWindow.AutoScroll;
+ _messageFontSize = Math.Clamp(
+ _settings.VoiceRepeaterWindow.TextSize > 0 ? _settings.VoiceRepeaterWindow.TextSize : DefaultTextSize,
+ 11,
+ 15);
+ _captionFontSize = Math.Max(9, _messageFontSize - 3);
+
+ DraftTextBlock.FontSize = _messageFontSize;
+ DraftCaptionTextBlock.FontSize = _captionFontSize;
+ TroubleshootingTextBlock.FontSize = _captionFontSize;
+
+ foreach (var item in _conversationItems)
+ {
+ item.MessageFontSize = _messageFontSize;
+ item.CaptionFontSize = _captionFontSize;
+ }
+
+ AutoScrollCheckBox.IsChecked = _autoScrollEnabled;
+ SelectTextSizeItem(_messageFontSize);
+ }
+ finally
+ {
+ _suppressSettingsEvents = false;
+ }
+ }
+
+ private void SaveWindowPlacement()
+ {
+ if (IsClosed || AppWindow is null)
+ {
+ return;
+ }
+
+ var size = AppWindow.Size;
+ var position = AppWindow.Position;
+ _settings.VoiceRepeaterWindow.Width = size.Width;
+ _settings.VoiceRepeaterWindow.Height = size.Height;
+ _settings.VoiceRepeaterWindow.X = position.X;
+ _settings.VoiceRepeaterWindow.Y = position.Y;
+ _settings.Save(logSuccess: false);
+ }
+
+ private void SelectTextSizeItem(double size)
+ {
+ var sizeTag = ((int)Math.Round(size)).ToString();
+ foreach (var entry in TextSizeComboBox.Items)
+ {
+ if (entry is ComboBoxItem item && string.Equals(item.Tag?.ToString(), sizeTag, StringComparison.Ordinal))
+ {
+ TextSizeComboBox.SelectedItem = item;
+ return;
+ }
+ }
+
+ TextSizeComboBox.SelectedIndex = 2;
+ }
+
+ private void UpdateConversationPlaceholder()
+ {
+ EmptyConversationTextBlock.Visibility = _conversationItems.Count == 0 && DraftPanel.Visibility != Visibility.Visible
+ ? Visibility.Visible
+ : Visibility.Collapsed;
+ }
+
+ private void ScrollConversationToEnd()
+ {
+ if (!_autoScrollEnabled)
+ {
+ return;
+ }
+
+ var dispatcherQueue = DispatcherQueue.GetForCurrentThread();
+ _ = dispatcherQueue?.TryEnqueue(() =>
+ {
+ ConversationScrollViewer.UpdateLayout();
+ ConversationScrollViewer.ChangeView(null, ConversationScrollViewer.ScrollableHeight, null, true);
+ _ = dispatcherQueue.TryEnqueue(() =>
+ ConversationScrollViewer.ChangeView(null, ConversationScrollViewer.ScrollableHeight, null, true));
+ });
+ }
+
+ private static string GetWindowStateLabel(VoiceStatusInfo status)
+ {
+ return status.State switch
+ {
+ VoiceRuntimeState.ListeningForVoiceWake => "listening",
+ VoiceRuntimeState.ListeningContinuously => "listening",
+ VoiceRuntimeState.RecordingUtterance => "hearing you",
+ VoiceRuntimeState.AwaitingResponse => "waiting",
+ VoiceRuntimeState.PlayingResponse => "speaking",
+ VoiceRuntimeState.Paused => "paused",
+ VoiceRuntimeState.Arming => "starting",
+ VoiceRuntimeState.Error => "error",
+ _ when status.Mode == VoiceActivationMode.Off => "off",
+ _ => "idle"
+ };
+ }
+
+ private sealed class ConversationItem : INotifyPropertyChanged
+ {
+ private double _messageFontSize;
+ private double _captionFontSize;
+
+ public ConversationItem(
+ string speaker,
+ string timestamp,
+ string message,
+ double messageFontSize,
+ double captionFontSize)
+ {
+ Speaker = speaker;
+ Timestamp = timestamp;
+ Message = message;
+ _messageFontSize = messageFontSize;
+ _captionFontSize = captionFontSize;
+ }
+
+ public string Speaker { get; }
+ public string Timestamp { get; }
+ public string Message { get; }
+ public string Caption => $"{Speaker} ┬╖ {Timestamp}";
+
+ public double MessageFontSize
+ {
+ get => _messageFontSize;
+ set
+ {
+ if (Math.Abs(_messageFontSize - value) < 0.01)
+ {
+ return;
+ }
+
+ _messageFontSize = value;
+ OnPropertyChanged();
+ }
+ }
+
+ public double CaptionFontSize
+ {
+ get => _captionFontSize;
+ set
+ {
+ if (Math.Abs(_captionFontSize - value) < 0.01)
+ {
+ return;
+ }
+
+ _captionFontSize = value;
+ OnPropertyChanged();
+ }
+ }
+
+ public event PropertyChangedEventHandler? PropertyChanged;
+
+ private void OnPropertyChanged([CallerMemberName] string? propertyName = null)
+ {
+ PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
+ }
+ }
+}
diff --git a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
index 0840aec..379991d 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceChatCoordinatorTests.cs
@@ -135,6 +135,32 @@ public async Task AttachWindow_ReplaysBufferedConversationTurns()
Assert.Equal(1, window.TurnCallCount);
}
+ [Fact]
+ public async Task DraftAndTurns_AreBroadcastToAllAttachedWindows()
+ {
+ var runtime = new FakeVoiceRuntime();
+ using var coordinator = new VoiceChatCoordinator(runtime, new ImmediateDispatcher());
+ var firstWindow = new FakeVoiceChatWindow();
+ var secondWindow = new FakeVoiceChatWindow();
+
+ coordinator.AttachWindow(firstWindow);
+ coordinator.AttachWindow(secondWindow);
+
+ runtime.RaiseDraft("shared draft", "main", clear: false);
+ runtime.RaiseConversationTurn(new VoiceConversationTurnEventArgs
+ {
+ Direction = VoiceConversationDirection.Incoming,
+ Message = "shared reply",
+ SessionKey = "main"
+ });
+ await Task.Yield();
+
+ Assert.Equal("shared draft", firstWindow.LastDraftText);
+ Assert.Equal("shared draft", secondWindow.LastDraftText);
+ Assert.Equal("shared reply", firstWindow.LastTurnMessage);
+ Assert.Equal("shared reply", secondWindow.LastTurnMessage);
+ }
+
private sealed class ImmediateDispatcher : IUiDispatcher
{
public bool TryEnqueue(Action callback)
From 5443dc31f3529531b87e6a3b2b26e27c7b122f8d Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Sat, 28 Mar 2026 09:29:31 +0000
Subject: [PATCH 79/83] Refine voice transport and webchat draft bridge
---
.../Services/Voice/VoiceService.cs | 114 ++++++++
.../Windows/WebChatVoiceDomBridge.cs | 112 +++++++
.../Windows/WebChatVoiceDomState.cs | 23 ++
.../Windows/WebChatWindow.xaml.cs | 273 +++---------------
4 files changed, 288 insertions(+), 234 deletions(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs
create mode 100644 src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 350f005..1f85f5e 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -4,6 +4,7 @@
using System.Linq;
using System.Net.Http;
using System.Diagnostics;
+using System.IO;
using System.Runtime.InteropServices.WindowsRuntime;
using System.Text.Json;
using System.Threading;
@@ -525,6 +526,7 @@ private async Task StartTalkModeRuntimeAsync(VoiceSettings settings, string? ses
synthesizer = new SpeechSynthesizer();
player = new MediaPlayer();
await ConfigurePlaybackOutputDeviceAsync(player, settings);
+ await WarmSpeechPlaybackPipelineAsync(player, synthesizer, selectedTextToSpeech, runtimeCts.Token);
if (recognizer != null)
{
@@ -1440,6 +1442,35 @@ private async Task SpeakTextAsync(string text, CancellationToken cancellationTok
await PlayStreamAsync(player, stream, stream.ContentType, cancellationToken);
}
+ private async Task WarmSpeechPlaybackPipelineAsync(
+ MediaPlayer player,
+ SpeechSynthesizer? synthesizer,
+ VoiceProviderOption provider,
+ CancellationToken cancellationToken)
+ {
+ var stopwatch = Stopwatch.StartNew();
+
+ try
+ {
+ using var silentStream = CreateSilentWaveStream();
+ await PreloadStreamAsync(player, silentStream, "audio/wav", cancellationToken);
+
+ if (!UsesCloudTextToSpeechRuntime(provider) && synthesizer != null)
+ {
+ using var warmupStream = await synthesizer.SynthesizeTextToStreamAsync(" ");
+ }
+
+ _logger.Info($"Voice playback warm-up completed: total={stopwatch.ElapsedMilliseconds}ms");
+ }
+ catch (OperationCanceledException)
+ {
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice playback warm-up failed: {ex.Message}");
+ }
+ }
+
private static bool UsesCloudTextToSpeechRuntime(VoiceProviderOption provider)
{
return provider.TextToSpeechHttp != null || provider.TextToSpeechWebSocket != null;
@@ -1661,6 +1692,89 @@ private static string CreateReplyPreview(string text)
return $"{trimmed[..117]}...";
}
+ private static InMemoryRandomAccessStream CreateSilentWaveStream()
+ {
+ const int sampleRate = 16000;
+ const short bitsPerSample = 16;
+ const short channels = 1;
+ const int durationMs = 120;
+
+ var bytesPerSample = bitsPerSample / 8;
+ var sampleCount = sampleRate * durationMs / 1000;
+ var dataSize = sampleCount * channels * bytesPerSample;
+ var byteRate = sampleRate * channels * bytesPerSample;
+ var blockAlign = (short)(channels * bytesPerSample);
+
+ var buffer = new byte[44 + dataSize];
+ using var writer = new BinaryWriter(new MemoryStream(buffer, writable: true));
+
+ writer.Write(System.Text.Encoding.ASCII.GetBytes("RIFF"));
+ writer.Write(36 + dataSize);
+ writer.Write(System.Text.Encoding.ASCII.GetBytes("WAVE"));
+ writer.Write(System.Text.Encoding.ASCII.GetBytes("fmt "));
+ writer.Write(16);
+ writer.Write((short)1);
+ writer.Write(channels);
+ writer.Write(sampleRate);
+ writer.Write(byteRate);
+ writer.Write(blockAlign);
+ writer.Write(bitsPerSample);
+ writer.Write(System.Text.Encoding.ASCII.GetBytes("data"));
+ writer.Write(dataSize);
+
+ var stream = new InMemoryRandomAccessStream();
+ using (var output = stream.AsStreamForWrite())
+ {
+ output.Write(buffer, 0, buffer.Length);
+ output.Flush();
+ }
+
+ stream.Seek(0);
+ return stream;
+ }
+
+ private static async Task PreloadStreamAsync(
+ MediaPlayer player,
+ IRandomAccessStream stream,
+ string contentType,
+ CancellationToken cancellationToken)
+ {
+ stream.Seek(0);
+ var mediaOpened = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
+
+ TypedEventHandler<MediaPlayer, object>? openedHandler = null;
+ TypedEventHandler<MediaPlayer, MediaPlayerFailedEventArgs>? failedHandler = null;
+
+ openedHandler = (sender, _) => mediaOpened.TrySetResult(true);
+ failedHandler = (sender, args) =>
+ {
+ var errorMessage = string.IsNullOrWhiteSpace(args.ErrorMessage)
+ ? "Media preload failed."
+ : args.ErrorMessage;
+ mediaOpened.TrySetException(new InvalidOperationException(errorMessage));
+ };
+
+ player.MediaOpened += openedHandler;
+ player.MediaFailed += failedHandler;
+ using var registration = cancellationToken.Register(() =>
+ {
+ try { player.Source = null; } catch { }
+ mediaOpened.TrySetCanceled(cancellationToken);
+ });
+
+ try
+ {
+ player.Source = MediaSource.CreateFromStream(stream, contentType);
+ await mediaOpened.Task;
+ }
+ finally
+ {
+ player.MediaOpened -= openedHandler;
+ player.MediaFailed -= failedHandler;
+ player.Source = null;
+ }
+ }
+
private static async Task PlayStreamAsync(
MediaPlayer player,
IRandomAccessStream stream,
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs
new file mode 100644
index 0000000..b331cb3
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs
@@ -0,0 +1,112 @@
+using System.Text.Json;
+
+namespace OpenClawTray.Windows;
+
+internal static class WebChatVoiceDomBridge
+{
+ public const string DocumentCreatedScript = """
+(() => {
+ const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
+ const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
+ let desiredDraft = '';
+ let stripInjectedMemories = true;
+
+ const sanitize = (value) => {
+ const text = typeof value === 'string' ? value : '';
+ return stripInjectedMemories ? text.replace(memoryPattern, '').trimStart() : text;
+ };
+
+ const findComposer = () => {
+ const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
+ return candidates.find(isVisible) || null;
+ };
+
+ const setElementValue = (el, value) => {
+ const sanitized = sanitize(value);
+ if ('value' in el) {
+ const proto = el.tagName === 'TEXTAREA' ? HTMLTextAreaElement.prototype : HTMLInputElement.prototype;
+ const descriptor = Object.getOwnPropertyDescriptor(proto, 'value');
+ if (descriptor && descriptor.set) {
+ descriptor.set.call(el, sanitized);
+ } else {
+ el.value = sanitized;
+ }
+ el.dispatchEvent(new InputEvent('input', { bubbles: true, data: sanitized, inputType: 'insertText' }));
+ el.dispatchEvent(new Event('change', { bubbles: true }));
+ return;
+ }
+
+ if (el.isContentEditable) {
+ el.textContent = sanitized;
+ el.dispatchEvent(new InputEvent('input', { bubbles: true, data: sanitized, inputType: 'insertText' }));
+ el.dispatchEvent(new Event('change', { bubbles: true }));
+ }
+ };
+
+ const applyDraftIfPossible = () => {
+ const composer = findComposer();
+ if (!composer) return false;
+ setElementValue(composer, desiredDraft);
+ return true;
+ };
+
+ const clearLegacyTurnsHost = () => {
+ const host = document.getElementById('openclaw-tray-voice-turns');
+ if (host) {
+ host.remove();
+ }
+ };
+
+ const observer = new MutationObserver(() => applyDraftIfPossible());
+ const start = () => {
+ if (!document.body) return;
+ observer.observe(document.body, { childList: true, subtree: true });
+ applyDraftIfPossible();
+ clearLegacyTurnsHost();
+ };
+
+ if (document.readyState === 'loading') {
+ document.addEventListener('DOMContentLoaded', start, { once: true });
+ } else {
+ start();
+ }
+
+ window.__openClawTrayVoice = {
+ setDraft(text) {
+ desiredDraft = text || '';
+ return applyDraftIfPossible();
+ },
+ setStripInjectedMemories(enabled) {
+ stripInjectedMemories = !!enabled;
+ return applyDraftIfPossible();
+ },
+ clearDraft() {
+ desiredDraft = '';
+ return applyDraftIfPossible();
+ },
+ setTurns() {
+ clearLegacyTurnsHost();
+ return true;
+ }
+ };
+})();
+""";
+
+ public static string BuildSetStripInjectedMemoriesScript(bool enabled)
+ {
+ var value = enabled ? "true" : "false";
+ return $"window.__openClawTrayVoice?.setStripInjectedMemories?.({value});";
+ }
+
+ public static string BuildSetDraftScript(string? text)
+ {
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return "window.__openClawTrayVoice?.clearDraft?.();";
+ }
+
+ return $"window.__openClawTrayVoice?.setDraft?.({JsonSerializer.Serialize(text)});";
+ }
+
+ public const string ClearLegacyTurnsScript = "window.__openClawTrayVoice?.setTurns?.([]);";
+}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs
new file mode 100644
index 0000000..e8f77f6
--- /dev/null
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs
@@ -0,0 +1,23 @@
+namespace OpenClawTray.Windows;
+
+internal sealed class WebChatVoiceDomState
+{
+ public WebChatVoiceDomState(bool stripInjectedMemories)
+ {
+ StripInjectedMemories = stripInjectedMemories;
+ }
+
+ public bool StripInjectedMemories { get; private set; }
+
+ public string PendingDraft { get; private set; } = string.Empty;
+
+ public void SetDraft(string? text, bool clear)
+ {
+ PendingDraft = clear ? string.Empty : (text ?? string.Empty);
+ }
+
+ public void SetStripInjectedMemories(bool enabled)
+ {
+ StripInjectedMemories = enabled;
+ }
+}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index ad863c8..93e62fc 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -5,11 +5,9 @@
using OpenClawTray.Services;
using OpenClawTray.Services.Voice;
using System;
-using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Runtime.InteropServices;
-using System.Text.Json;
using System.Threading.Tasks;
using WinUIEx;
using Windows.Foundation;
@@ -21,199 +19,31 @@ public sealed partial class WebChatWindow : WindowEx
{
private readonly string _gatewayUrl;
private readonly string _token;
- private bool _stripInjectedMemories;
- private string _pendingVoiceDraft = string.Empty;
- private readonly List<VoiceConversationTurnMirror> _pendingVoiceTurns = [];
-
- // Store event handlers for cleanup
+ private readonly WebChatVoiceDomState _voiceDomState;
+ private bool _voiceDomReady;
+
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationCompletedEventArgs>? _navigationCompletedHandler;
private TypedEventHandler<CoreWebView2, CoreWebView2NavigationStartingEventArgs>? _navigationStartingHandler;
-
- public bool IsClosed { get; private set; }
- private sealed record VoiceConversationTurnMirror(string Direction, string Text);
-
-private const string TrayVoiceIntegrationScript = """
-(() => {
- const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
- const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
- let desiredDraft = '';
- let stripInjectedMemories = true;
- const findComposer = () => {
- const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
- return candidates.find(isVisible) || null;
- };
- const setElementValue = (el, value) => {
- if ('value' in el) {
- const proto = el.tagName === 'TEXTAREA' ? HTMLTextAreaElement.prototype : HTMLInputElement.prototype;
- const descriptor = Object.getOwnPropertyDescriptor(proto, 'value');
- if (descriptor && descriptor.set) {
- descriptor.set.call(el, value);
- } else {
- el.value = value;
- }
- el.dispatchEvent(new InputEvent('input', { bubbles: true, data: value, inputType: 'insertText' }));
- el.dispatchEvent(new Event('change', { bubbles: true }));
- return;
- }
- if (el.isContentEditable) {
- el.textContent = value;
- el.dispatchEvent(new InputEvent('input', { bubbles: true, data: value, inputType: 'insertText' }));
- el.dispatchEvent(new Event('change', { bubbles: true }));
- }
- };
- let desiredTurns = [];
- const ensureTurnsHost = () => {
- if (!document.body) return null;
- let host = document.getElementById('openclaw-tray-voice-turns');
- if (host) return host;
- host = document.createElement('div');
- host.id = 'openclaw-tray-voice-turns';
- Object.assign(host.style, {
- position: 'fixed',
- left: '16px',
- right: '16px',
- bottom: '88px',
- zIndex: '2147483000',
- display: 'flex',
- flexDirection: 'column',
- gap: '8px',
- pointerEvents: 'none',
- alignItems: 'stretch'
- });
- document.body.appendChild(host);
- return host;
- };
- const renderTurns = () => {
- const host = ensureTurnsHost();
- if (!host) return false;
- host.innerHTML = '';
- const items = Array.isArray(desiredTurns) ? desiredTurns : [];
- if (items.length === 0) {
- host.style.display = 'none';
- return true;
- }
- host.style.display = 'flex';
- for (const item of items) {
- if (!item || !item.text) continue;
- const row = document.createElement('div');
- Object.assign(row.style, {
- display: 'flex',
- justifyContent: item.direction === 'incoming' ? 'flex-start' : 'flex-end'
- });
- const bubble = document.createElement('div');
- bubble.textContent = item.text;
- Object.assign(bubble.style, {
- maxWidth: 'min(70vw, 720px)',
- padding: '10px 14px',
- borderRadius: '16px',
- boxShadow: '0 8px 20px rgba(15, 23, 42, 0.12)',
- border: item.direction === 'incoming'
- ? '1px solid rgba(148, 163, 184, 0.35)'
- : '1px solid rgba(59, 130, 246, 0.35)',
- background: item.direction === 'incoming'
- ? 'rgba(255, 255, 255, 0.94)'
- : 'rgba(219, 234, 254, 0.96)',
- color: '#0f172a',
- font: '500 14px/1.4 \"Segoe UI\", sans-serif',
- whiteSpace: 'pre-wrap'
- });
- row.appendChild(bubble);
- host.appendChild(row);
- }
- return true;
- };
- const applyDraftIfPossible = () => {
- const composer = findComposer();
- if (!composer) return false;
- setElementValue(composer, desiredDraft);
- return true;
- };
- const cleanTextNodes = () => {
- if (!stripInjectedMemories || !document.body) return false;
- const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT);
- const nodes = [];
- let current;
- while ((current = walker.nextNode())) {
- nodes.push(current);
- }
- let changed = false;
- for (const node of nodes) {
- if (!node || !node.parentElement) continue;
- const tag = node.parentElement.tagName;
- if (tag === 'SCRIPT' || tag === 'STYLE' || tag === 'TEXTAREA') continue;
- const original = node.textContent || '';
- const withoutMemories = original.replace(memoryPattern, '');
- if (withoutMemories !== original) {
- const cleaned = withoutMemories.trimStart();
- node.textContent = cleaned;
- changed = true;
- }
- }
- return changed;
- };
- let refreshScheduled = false;
- const refreshView = () => {
- if (refreshScheduled) return;
- refreshScheduled = true;
- queueMicrotask(() => {
- refreshScheduled = false;
- cleanTextNodes();
- applyDraftIfPossible();
- renderTurns();
- });
- };
- const observer = new MutationObserver(() => refreshView());
- const start = () => {
- if (!document.body) return;
- observer.observe(document.body, { childList: true, subtree: true });
- refreshView();
- };
- if (document.readyState === 'loading') {
- document.addEventListener('DOMContentLoaded', start, { once: true });
- } else {
- start();
- }
- window.__openClawTrayVoice = {
- setDraft(text) {
- desiredDraft = text || '';
- return applyDraftIfPossible();
- },
- setStripInjectedMemories(enabled) {
- stripInjectedMemories = !!enabled;
- refreshView();
- return true;
- },
- setTurns(turns) {
- desiredTurns = Array.isArray(turns) ? turns : [];
- return renderTurns();
- },
- clearDraft() {
- desiredDraft = '';
- return applyDraftIfPossible();
- }
- };
-})();
-""";
+ public bool IsClosed { get; private set; }
public WebChatWindow(string gatewayUrl, string token, bool stripInjectedMemories)
{
Logger.Info($"WebChatWindow: Constructor called, gateway={gatewayUrl}");
_gatewayUrl = gatewayUrl;
_token = token;
- _stripInjectedMemories = stripInjectedMemories;
-
+ _voiceDomState = new WebChatVoiceDomState(stripInjectedMemories);
+
InitializeComponent();
-
- // Window configuration
+
this.SetWindowSize(520, 750);
this.MinWidth = 380;
this.MinHeight = 450;
this.CenterOnScreen();
this.SetIcon(IconHelper.GetStatusIconPath(ConnectionStatus.Connected));
-
+
Closed += OnWindowClosed;
-
+
Logger.Info("WebChatWindow: Starting InitializeWebViewAsync");
_ = InitializeWebViewAsync();
}
@@ -221,8 +51,8 @@ public WebChatWindow(string gatewayUrl, string token, bool stripInjectedMemories
private void OnWindowClosed(object sender, WindowEventArgs e)
{
IsClosed = true;
-
- // Cleanup WebView2 event handlers
+ _voiceDomReady = false;
+
if (WebView.CoreWebView2 != null)
{
if (_navigationCompletedHandler != null)
@@ -237,37 +67,39 @@ private async Task InitializeWebViewAsync()
try
{
Logger.Info("WebChatWindow: Initializing WebView2...");
-
- // Set up user data folder for WebView2
+
var userDataFolder = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData),
"OpenClawTray", "WebView2");
-
+
Directory.CreateDirectory(userDataFolder);
Logger.Info($"WebChatWindow: User data folder: {userDataFolder}");
- // Set environment variable for user data folder
Environment.SetEnvironmentVariable("WEBVIEW2_USER_DATA_FOLDER", userDataFolder);
-
+
Logger.Info("WebChatWindow: Calling EnsureCoreWebView2Async...");
await WebView.EnsureCoreWebView2Async();
Logger.Info("WebChatWindow: CoreWebView2 initialized successfully");
-
- // Configure WebView2
+
WebView.CoreWebView2.Settings.IsStatusBarEnabled = false;
WebView.CoreWebView2.Settings.AreDefaultContextMenusEnabled = true;
WebView.CoreWebView2.Settings.IsZoomControlEnabled = true;
- await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(TrayVoiceIntegrationScript);
+ await WebView.CoreWebView2.AddScriptToExecuteOnDocumentCreatedAsync(WebChatVoiceDomBridge.DocumentCreatedScript);
+
+ _voiceDomReady = false;
- // Handle navigation events (store for cleanup)
_navigationCompletedHandler = (s, e) =>
{
Logger.Info($"WebChatWindow: Navigation completed, success={e.IsSuccess}, status={e.WebErrorStatus}");
LoadingRing.IsActive = false;
LoadingRing.Visibility = Visibility.Collapsed;
- _ = RefreshTrayVoiceDomStateAsync();
-
- // Show friendly error if connection failed
+ _voiceDomReady = e.IsSuccess;
+
+ if (e.IsSuccess)
+ {
+ _ = RefreshTrayVoiceDomStateAsync();
+ }
+
if (!e.IsSuccess && (e.WebErrorStatus == CoreWebView2WebErrorStatus.ConnectionAborted ||
e.WebErrorStatus == CoreWebView2WebErrorStatus.CannotConnect ||
e.WebErrorStatus == CoreWebView2WebErrorStatus.ConnectionReset ||
@@ -290,15 +122,14 @@ private async Task InitializeWebViewAsync()
_navigationStartingHandler = (s, e) =>
{
- // Strip query params to avoid logging tokens
var safeUri = e.Uri?.Split('?')[0] ?? "unknown";
Logger.Info($"WebChatWindow: Navigation starting to {safeUri}");
+ _voiceDomReady = false;
LoadingRing.IsActive = true;
LoadingRing.Visibility = Visibility.Visible;
};
WebView.CoreWebView2.NavigationStarting += _navigationStartingHandler;
- // Navigate to chat
NavigateToChat();
}
catch (Exception ex)
@@ -310,13 +141,12 @@ private async Task InitializeWebViewAsync()
Logger.Error($"WebView2 inner exception: {ex.InnerException.GetType().FullName}: {ex.InnerException.Message}");
}
Logger.Error($"WebView2 stack trace: {ex.StackTrace}");
-
- // Show error in the dialog instead of falling back to browser
+
LoadingRing.IsActive = false;
LoadingRing.Visibility = Visibility.Collapsed;
WebView.Visibility = Visibility.Collapsed;
ErrorPanel.Visibility = Visibility.Visible;
-
+
var errorDetails = $"Exception: {ex.GetType().FullName}\n" +
$"HResult: 0x{ex.HResult:X8}\n" +
$"Message: {ex.Message}\n\n" +
@@ -324,17 +154,16 @@ private async Task InitializeWebViewAsync()
$"Architecture: {RuntimeInformation.ProcessArchitecture}\n" +
$"OS: {RuntimeInformation.OSDescription}\n\n" +
$"Stack Trace:\n{ex.StackTrace}";
-
+
if (ex.InnerException != null)
{
errorDetails += $"\n\nInner Exception: {ex.InnerException.GetType().FullName}\n{ex.InnerException.Message}";
}
-
+
ErrorText.Text = errorDetails;
}
}
- // Set to a test URL to bypass gateway (e.g., "https://www.bing.com"), or null for normal operation
private const string? DEBUG_TEST_URL = null;
private static bool IsLocalHost(Uri uri)
@@ -383,12 +212,11 @@ private void ShowErrorMessage(string message)
ErrorPanel.Visibility = Visibility.Visible;
ErrorText.Text = message;
}
-
+
private void NavigateToChat()
{
if (WebView.CoreWebView2 == null) return;
- // If debug URL is set, use it instead of gateway
if (!string.IsNullOrEmpty(DEBUG_TEST_URL))
{
Logger.Info($"WebChatWindow: DEBUG MODE - Navigating to test URL: {DEBUG_TEST_URL}");
@@ -426,7 +254,7 @@ private void OnPopout(object sender, RoutedEventArgs e)
ShowErrorMessage(errorMessage);
return;
}
-
+
try
{
Process.Start(new ProcessStartInfo(url) { UseShellExecute = true });
@@ -444,58 +272,35 @@ private void OnDevTools(object sender, RoutedEventArgs e)
public async Task UpdateVoiceTranscriptDraftAsync(string text, bool clear)
{
- _pendingVoiceDraft = clear ? string.Empty : (text ?? string.Empty);
+ _voiceDomState.SetDraft(text, clear);
await RefreshTrayVoiceDomStateAsync();
}
public async Task AppendVoiceConversationTurnAsync(VoiceConversationTurnEventArgs args)
{
- ArgumentNullException.ThrowIfNull(args);
-
- if (args.Direction != VoiceConversationDirection.Outgoing ||
- string.IsNullOrWhiteSpace(args.Message))
- {
- return;
- }
-
- _pendingVoiceTurns.Add(new VoiceConversationTurnMirror("outgoing", args.Message.Trim()));
- if (_pendingVoiceTurns.Count > 6)
- {
- _pendingVoiceTurns.RemoveAt(0);
- }
-
- await RefreshTrayVoiceDomStateAsync();
+ await Task.CompletedTask;
}
public async Task SetStripInjectedMemoriesEnabledAsync(bool enabled)
{
- _stripInjectedMemories = enabled;
+ _voiceDomState.SetStripInjectedMemories(enabled);
await RefreshTrayVoiceDomStateAsync();
}
private async Task RefreshTrayVoiceDomStateAsync()
{
- if (WebView.CoreWebView2 == null)
+ if (WebView.CoreWebView2 == null || !_voiceDomReady || IsClosed)
{
return;
}
try
{
- var stripJson = _stripInjectedMemories ? "true" : "false";
await WebView.CoreWebView2.ExecuteScriptAsync(
- $"window.__openClawTrayVoice?.setStripInjectedMemories?.({stripJson});");
-
- var draftJson = JsonSerializer.Serialize(_pendingVoiceDraft ?? string.Empty);
- var script = string.IsNullOrWhiteSpace(_pendingVoiceDraft)
- ? "window.__openClawTrayVoice?.clearDraft?.();"
- : $"window.__openClawTrayVoice?.setDraft?.({draftJson});";
-
- await WebView.CoreWebView2.ExecuteScriptAsync(script);
-
- var turnsJson = JsonSerializer.Serialize(_pendingVoiceTurns);
+ WebChatVoiceDomBridge.BuildSetStripInjectedMemoriesScript(_voiceDomState.StripInjectedMemories));
await WebView.CoreWebView2.ExecuteScriptAsync(
- $"window.__openClawTrayVoice?.setTurns?.({turnsJson});");
+ WebChatVoiceDomBridge.BuildSetDraftScript(_voiceDomState.PendingDraft));
+ await WebView.CoreWebView2.ExecuteScriptAsync(WebChatVoiceDomBridge.ClearLegacyTurnsScript);
}
catch (Exception ex)
{
From dc8651c1c69c994cf57a0f0aca8fcfed798783c2 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 30 Mar 2026 11:15:56 +0100
Subject: [PATCH 80/83] Refine voice mode controls and docs
---
README.md | 12 +-
docs/VOICE-MODE.md | 355 ++++--------------
src/OpenClaw.Shared/SettingsData.cs | 2 +
src/OpenClaw.Shared/VoiceModeSchema.cs | 4 +-
src/OpenClaw.Tray.WinUI/App.xaml.cs | 28 +-
.../Controls/VoiceSettingsPanel.xaml | 5 +-
.../Controls/VoiceSettingsPanel.xaml.cs | 13 +-
.../Services/Voice/VoiceService.cs | 226 +++++++----
.../Windows/VoiceRepeaterWindow.xaml | 26 +-
.../Windows/VoiceRepeaterWindow.xaml.cs | 120 +++++-
.../Windows/WebChatVoiceDomBridge.cs | 31 +-
.../Windows/WebChatVoiceDomState.cs | 10 +-
.../Windows/WebChatWindow.xaml.cs | 12 +-
13 files changed, 421 insertions(+), 423 deletions(-)
diff --git a/README.md b/README.md
index b0c3e40..9ea5ef5 100644
--- a/README.md
+++ b/README.md
@@ -84,16 +84,18 @@ Modern Windows 11-style system tray companion that connects to your local OpenCl
- 🚀 **Auto-start** - Launch with Windows
- ⚙️ **Settings** - Full configuration dialog
- 🎯 **First-run experience** - Welcome dialog guides new users
+- 🦞🎧 **Voice Mode (new)** - Talk to your Claw via your Windows node
### Menu Sections
- **Status** - Gateway connection status with click-to-view details
+- **Voice** - Access to Voice controls
- **Sessions** - Active agent sessions with preview and per-session controls
- **Usage** - Provider/cost summary with quick jump to activity details
- **Channels** - Telegram/WhatsApp status with toggle control
- **Nodes** - Online/offline node inventory and copyable summary
- **Recent Activity** - Timestamped event stream for sessions, usage, nodes, and notifications
- **Actions** - Dashboard, Web Chat, Quick Send, Activity Stream, History
-- **Settings** - Configuration, auto-start, logs
+- **Settings** - Configuration, auto-start, logs, voice
### Mac Parity Status
@@ -246,6 +248,14 @@ OpenClaw registers the `openclaw://` URL scheme for automation and integration:
Deep links work even when Molty is already running - they're forwarded via IPC.
+### Voice Mode
+*built by NichUK and his colleagues @codex and @copilot*
+
+Currently supports Talk Mode - Always on talk to your Claw! Wakeword and PTT modes coming soon
+- Uses internal Windows STT (cloud providers coming soon)
+- Windows/Minimax/Eleven Labs TTS voices
+ - Give your Claw a voice!
+
## 📦 OpenClaw.CommandPalette
PowerToys Command Palette extension for quick OpenClaw access.
diff --git a/docs/VOICE-MODE.md b/docs/VOICE-MODE.md
index 77c6bb2..87d6011 100644
--- a/docs/VOICE-MODE.md
+++ b/docs/VOICE-MODE.md
@@ -1,16 +1,19 @@
# Voice Mode Architecture
+*Author: Nich Overend (NichUK@GitHub) - with @codex and @copilot*
+https://github.com/openclaw/openclaw-windows-node
+
This document defines the voice subsystem for the Windows node only. It introduces the command surface, persisted settings schema, and minimum runtime boundaries needed to add Windows voice support without reshaping the existing node architecture.
## Goals
- Add a node-local voice mode with two activation modes: `VoiceWake` and `TalkMode`
-- Utilise minimal touch points to the existing app to reduce the potential for screw-ups.
+- Utilise minimal touch points to the existing app to reduce the potential for screw-ups
- Use NanoWakeWord for wakeword detection on-device
- Present the user-facing mode names as `Voice Wake` and `Talk Mode`
-- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-ins
+- Keep STT/TTS provider selection configurable, with Windows implementations as the default built-in baseline
- Implement `MiniMax` TTS and `ElevenLabs` TTS as required non-Windows providers after the Windows baseline
-- Make adding new voice providers an update to a Json catalog, rather than requiring code changes
+- Make adding new voice providers an update to a Json catalog, rather than requiring code changes where possible
- Reuse the existing node capability pattern instead of introducing a parallel control path
- Ensure that the voice sub-system is extensible
- Ensure that the voice sub-system is controllable from other applications
@@ -56,18 +59,16 @@ The contracts and persisted settings now use `VoiceWake` and `TalkMode` as well.
- if speech activity ends without any usable final transcript surviving, Talk Mode now clears the draft and gives a short local repeat prompt instead of silently doing nothing
- the compact voice repeater window, when open, shows the live transcript draft plus local sent/received turns in a single scrolling surface
- the tray chat window, when open, mirrors the live transcript draft into the compose box only
-- the finalized transcript is always sent to OpenClaw via direct `chat.send` on the main session
+- the finalized transcript is always sent to OpenClaw via direct `chat.send` on the voice mode target session, which is currently hardcoded in the tray app to `agent:main:main`
- OpenClaw returns the assistant reply as normal chat output
- the node performs local or remote TTS playback of that reply
- assistant replies are queued locally and spoken sequentially, with a short (500 ms currently) pause between queued replies so overlapping responses are not lost
-- if a reply arrives after the normal 45-second wait timeout, the tray still accepts and speaks that late reply for a short bounded grace window so slow upstream responses are not silently lost
-- the tray chat window can optionally strip injected `<relevant-memories>...</relevant-memories>` blocks from mirrored draft text before that draft is injected into the compose box
+- if a reply arrives after the normal 45-second wait timeout, the tray still accepts and speaks that late reply for a short bounded grace window (currently 120s) so slow upstream responses are not silently lost
+- assistant replies are currently accepted from either `agent:main:main` or the `main` alias so the tray can tolerate upstream session-key normalisation differences
To avoid obvious duplicate sends from the Windows recognizer, exact duplicate final transcripts are suppressed within a short 750 ms window.
-That means the first Windows target is transcript transport, not raw audio upload. Streaming audio frames in or out of OpenClaw remains a future protocol extension, and therefore not part of this design.
-
-The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That sidecar connection carries assistant chat events for `TalkMode`, while the recognized transcript is always sent through the tray app's direct `chat.send` path.
+The current Windows implementation uses a voice-local operator connection inside the tray app while node mode is active. That connection carries assistant chat events for `TalkMode`, while the recognized transcript is always sent through the tray app's direct `chat.send` path.
## Voice APIs
@@ -88,13 +89,13 @@ The node capability command surface is:
- `voice.stop`
- `voice.pause`
- `voice.resume`
-- `voice.skip`
+- `voice.response.skip`
These commands are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/VoiceModeSchema.cs) and handled by [VoiceCapability.cs](../src/OpenClaw.Shared/Capabilities/VoiceCapability.cs).
`voice.settings.get` / `voice.settings.set` are the configuration API.
-`voice.start` / `voice.stop` / `voice.pause` / `voice.resume` / `voice.skip` are the runtime control API.
+`voice.start` / `voice.stop` / `voice.pause` / `voice.resume` / `voice.response.skip` are the runtime control API.
### Status Surface
@@ -142,92 +143,35 @@ The current tray implementation now uses the voice configuration interface for:
That means the settings UI is no longer hard-wired only to concrete `VoiceService` internals for its voice-specific behavior.
-## Speech Output Latency
-
-Microsoft's Azure Speech SDK latency guidance is specifically about speech synthesis, not speech recognition, so it applies to Windows voice output rather than voice input. Source: [Lower speech synthesis latency using Speech SDK](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-csharp).
+## Speech Output Implementation
-The current Windows implementation already follows the guidance where it maps cleanly:
+In order to reduce output latency as much as possible, the current Windows implementation has made the following implementation decisions:
- the Windows `SpeechSynthesizer` is created once per `TalkMode` runtime and reused for subsequent replies
+ - Frankly, no one will probably use it, but everyone has it, so...
- cloud TTS uses a shared static `HttpClient`, so HTTP/TLS connections can be reused across replies
- cloud requests use `ResponseHeadersRead`, which lets the client observe response-header arrival without waiting for full buffering first
- the tray app now logs per-reply synthesis timings for both Windows and cloud TTS paths so latency can be measured directly during testing
-The main remaining gap is streaming playback from the first audio chunk. The Azure guidance recommends chunked playback as soon as the first audio arrives, but the current Windows implementation still waits for a complete playable stream before starting output:
+The main remaining gap is streaming playback from the first audio chunk. Best practice recommends chunked playback as soon as the first audio arrives, but the current implementation still waits for a complete playable stream before starting output (but not for long...):
- Windows `SpeechSynthesizer` is used through `SynthesizeTextToStreamAsync`, which returns a complete stream for playback
- MiniMax now uses the provider catalog's WebSocket TTS contract, but the current player still waits for a complete playable stream before output starts
- ElevenLabs now uses the provider catalog's `stream-input` WebSocket contract, but the current player still waits for a complete playable stream before output starts
-So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming. This is, however, planned for an early release.
+So the current design minimizes avoidable setup and connection latency, but does not yet implement first-chunk playback streaming. This is however, planned for an early release (I'm working on it next).
## Tray Chat Integration Decision
-Voice mode and typed chat must remain part of the same user-visible conversation in the tray app. Creating a separate "voice session" would reduce implementation complexity, but it would make the chat experience harder to understand:
-
-- voice utterances would not appear in the same tray chat history as typed messages
-- the user would need to reason about two concurrent sessions for one tray app
-- voice replies and typed replies could diverge across windows
-
-### Problem Encountered
+Ideally Voice mode and typed chat should remain part of the same user-visible conversation in the web chat UI, however this proved difficult to achieve, as the gateway treated a message stream from the tray app seperately to that from the WebUI, even with the same session key.
-When `TalkMode` sends transcript text to the main OpenClaw session, the upstream session can include scaffolding such as `<relevant-memories>...</relevant-memories>` in the rendered user message body shown in the tray chat window.
-
-That produced two UX problems:
-
-- the tray chat bubble did not show the clean spoken transcript the user actually said
-- the embedded tray chat window had no draft/update API for showing interim STT hypotheses while the user was still speaking
-
-### Routes Examined
-
-1. Dedicated voice session
- - technically clean from a transport perspective
- - rejected because it fragments the tray chat experience and is confusing for users
-2. Upstream OpenClaw change to suppress memory scaffolding for voice turns
- - desirable long-term if OpenClaw exposes a first-class voice-aware chat surface
- - rejected for the current phase because this Windows tray feature must work without waiting for upstream protocol/UI changes
-3. Tray-local DOM mediation in the embedded chat window
- - chosen
- - keeps a single session and single tray chat history
- - allows interim hypotheses to appear in the tray compose box in near real time
- - does not require the voice transport to depend on WebView DOM submission
+The only way of achieving this vaguely reliably seemed to be to locally insert messages into the DOM, but as this was a brittle, hacky solution, it was disgarded.
### Chosen Approach
-The tray app keeps a tray-local interim transcript buffer for the current utterance, independent of whether the chat window is open.
-
-The tray now has two separate local voice surfaces:
+It was therefore decided to create a separate *voice repeater form* to serve as a message window for voice, as well as making the messages available via toasts.
-- [WebChatWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs)
- - mirrors the live transcript draft into the real WebChat compose box when WebChat is open
- - optionally strips `<relevant-memories>...</relevant-memories>` from that mirrored draft text before injection
- - does not own transport
- - does not try to fake typed-chat parity for sent voice turns
-- [VoiceRepeaterWindow.xaml.cs](../src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs)
- - is the compact tray-local voice surface
- - shows live transcript, outgoing sent text, and incoming replies in one scrollable transcript strip
- - exposes compact pause/resume, skip, mic re-arm/reset, and local repeater settings controls
- - can open the older Voice Status window for deeper troubleshooting/status detail
-
-The embedded WebChat surface therefore only owns draft mirroring:
-
-- interim STT hypotheses from Windows speech recognition are injected into the tray chat compose box while the user is speaking
-- if the chat window opens during an utterance, the current buffered transcript is copied into the compose box immediately
-- if the chat window closes during an utterance, voice continues windowless and the final utterance still submits
-- the final utterance always goes through the voice service's direct `chat.send` path
-- the tray chat window remains a draft/view surface rather than a transport dependency
-
-This is intentionally a tray-local integration decision, not a protocol-level rewrite of the stored upstream transcript.
-
-### Tradeoffs
-
-- preserves a single visible conversation for the user
-- avoids a second voice-only session in the tray UI
-- uses only one send path for voice turns, which is simpler to reason about and debug
-- requires us to change the existing project, but not too significantly
-- keeps a light DOM integration inside the embedded WebView chat surface for draft mirroring only
-- gives voice mode a separate compact surface that does not depend on WebChat DOM behavior for basic transcript/reply visibility
-- only affects the tray app chat window; other clients still render upstream content according to their own rules
+The tray app keeps a tray-local interim transcript buffer for the current utterance, independent of whether any chat window or voice repeater form is open.
## Provider Selection
@@ -400,7 +344,7 @@ Example:
For cloud-backed TTS providers, the catalog carries either an HTTP or WebSocket request/response contract. That allows a new provider to be added by shipping an updated catalog file with the app, as long as it follows the same general templated transport approach.
-This file defines provider metadata and transport contracts. It does not carry API keys.
+This file defines provider metadata and transport contracts. It does not carry API keys, these are stored with the standard config.
### Local Provider Configuration
@@ -443,6 +387,9 @@ The current cloud TTS transports are:
For `VoiceWake`, trigger words are gateway-owned global state. The Windows node should eventually consume the same shared trigger list and keep only a local enabled/disabled toggle plus device/runtime settings.
+In-flight voice controls are supported, if supported by the chosen provider and provided in their format, although an abstraction/translation layer is being considered, to accompany support for OpenClaw voice directives in replies records.
+Pronunciation dictionaries are also only currently supported directly on the voice provider, however a centralised dictionary is possible, and a proposal is being considered.
+
## Command Surface
The voice subsystem is introduced as a new node capability category: `voice`.
@@ -459,7 +406,7 @@ The voice subsystem is introduced as a new node capability category: `voice`.
| `voice.stop` | Stop the voice runtime | `VoiceStopArgs` | `VoiceStatusInfo` |
| `voice.pause` | Pause the active voice runtime | `VoicePauseArgs` | `VoiceStatusInfo` |
| `voice.resume` | Resume a paused voice runtime | `VoiceResumeArgs` | `VoiceStatusInfo` |
-| `voice.skip` | Skip the currently spoken reply and advance the queue if another reply is pending | `VoiceSkipArgs` | `VoiceStatusInfo` |
+| `voice.response.skip` | Skip the currently spoken reply and advance the queue if another reply is pending | `VoiceSkipArgs` | `VoiceStatusInfo` |
### Payload Types
@@ -481,10 +428,26 @@ These contracts are defined in [VoiceModeSchema.cs](../src/OpenClaw.Shared/Voice
Voice settings are persisted as `SettingsData.Voice` in [SettingsData.cs](../src/OpenClaw.Shared/SettingsData.cs).
Provider configuration is persisted as `SettingsData.VoiceProviderConfiguration` in the same local settings file.
+The compact repeater window state is persisted as `SettingsData.VoiceRepeaterWindow` in the same settings file.
The editable voice configuration now lives in the main Settings window.
The tray `Voice Mode` window is a read-only runtime status/detail surface with a shortcut back into Settings.
+### Voice Repeater Window Settings
+
+The compact repeater persists its own local UI state in `SettingsData.VoiceRepeaterWindow`:
+
+| Setting | Type | Default | Meaning |
+|---|---|---|---|
+| `VoiceRepeaterWindow.AutoScroll` | bool | `true` | Automatically scroll the transcript surface to the latest draft/reply |
+| `VoiceRepeaterWindow.FloatingEnabled` | bool | `true` | Keep the repeater floating above other windows |
+| `VoiceRepeaterWindow.TextSize` | double | `13` | Repeater transcript font size |
+| `VoiceRepeaterWindow.HasSavedPlacement` | bool | `false` | Whether a user placement has been persisted yet |
+| `VoiceRepeaterWindow.Width` | int? | `null` | Saved repeater width |
+| `VoiceRepeaterWindow.Height` | int? | `null` | Saved repeater height |
+| `VoiceRepeaterWindow.X` | int? | `null` | Saved repeater screen X coordinate |
+| `VoiceRepeaterWindow.Y` | int? | `null` | Saved repeater screen Y coordinate |
+
### Effective Schema
```json
@@ -492,6 +455,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
"Voice": {
"Mode": "VoiceWake",
"Enabled": true,
+ "ShowRepeaterAtStartup": true,
"SpeechToTextProviderId": "windows",
"TextToSpeechProviderId": "windows",
"InputDeviceId": "default-mic",
@@ -544,6 +508,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
|---|---|
| `Mode` | Top-level activation mode: `Off`, `VoiceWake`, `TalkMode` |
| `Enabled` | Global feature kill-switch independent of mode |
+| `ShowRepeaterAtStartup` | Opens the compact Voice Mode repeater automatically when the app starts with voice mode active |
| `SpeechToTextProviderId` | Selected STT provider id from the local provider catalog |
| `TextToSpeechProviderId` | Selected TTS provider id from the local provider catalog |
| `InputDeviceId` / `OutputDeviceId` | Preferred audio device binding, with selected-speaker support implemented first |
@@ -559,7 +524,7 @@ The tray `Voice Mode` window is a read-only runtime status/detail surface with a
|---|---|---|---|---|
| `Voice.Mode` | enum | `Off` | all | Activation mode: `Off`, `VoiceWake`, `TalkMode` |
| `Voice.Enabled` | bool | `false` | all | Master enable/disable flag for voice mode |
-| `Voice.StripInjectedMemoriesInChat` | bool | `true` | all | If `true`, the tray chat window strips injected `<relevant-memories>` scaffolding from rendered chat text |
+| `Voice.ShowRepeaterAtStartup` | bool | `true` | all | If `true`, the compact Voice Mode repeater opens automatically when the app starts with voice mode active |
| `Voice.SpeechToTextProviderId` | string | `windows` | all | Preferred speech-to-text provider id |
| `Voice.TextToSpeechProviderId` | string | `windows` | all | Preferred text-to-speech provider id |
| `Voice.InputDeviceId` | string? | `null` | all | Preferred microphone device id; `null` means system default |
@@ -609,7 +574,7 @@ The current Windows implementation is still centred on `VoiceService`, with a fe
- `OpenClawGatewayClient`
carries direct `chat.send`, final chat events, and the `sessions.preview` fallback path for bare final markers
- `WebChatWindow`
- mirrors live transcript drafts into the WebChat compose box and can strip injected `<relevant-memories>` blocks from mirrored draft text before injection
+ mirrors live transcript drafts into the WebChat compose box
- `VoiceRepeaterWindow`
is the compact local transcript/reply/control surface for Talk Mode
@@ -632,7 +597,7 @@ flowchart LR
C --> J["ResultGenerated<br/>final Medium/High text"]
J --> K["VoiceService<br/>duplicate guard + late hypothesis promotion"]
K --> L["Stop recognition session"]
- L --> M["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(main, transcript)"]
+ L --> M["OpenClawGatewayClient.SendChatMessageAsync<br/>direct chat.send(agent:main:main, transcript)"]
M --> N["OpenClaw / session pipeline"]
K --> H2["VoiceChatCoordinator<br/>outgoing turn event"]
H2 --> I2
@@ -671,7 +636,7 @@ flowchart LR
## Planned AudioGraph Input Architecture
-The next input-phase refactor should move microphone ownership away from `SpeechRecognizer` and into an explicit capture pipeline built around `AudioGraph`.
+The next input-phase refactor will move microphone ownership away from `SpeechRecognizer` and into an explicit capture pipeline built around `AudioGraph`.
The purpose of that change is to unlock:
@@ -679,7 +644,7 @@ The purpose of that change is to unlock:
- streaming rather than utterance-owned capture
- a proper ring buffer and VAD pipeline
- future non-Windows and streaming STT providers
-- future barge-in / duplex work
+- future barge-in / full-duplex work
### Target Input Stack
@@ -717,46 +682,13 @@ The target split should look like this:
- `VoiceService`
- remains the runtime orchestrator rather than the owner of low-level capture
-### Proposed STT Adapter Contract
-
-The STT conversion layer should no longer be "whatever `SpeechRecognizer` does internally". It should become an adapter boundary.
-
-Suggested shape:
-
-- `StartAsync(SpeechToTextStartArgs)`
-- `StopAsync()`
-- `PushFramesAsync(ReadOnlyMemory<byte> pcm16, FrameMetadata metadata)` for streaming-capable adapters
-- `SubmitUtteranceAsync(ReadOnlyMemory<byte> pcm16, UtteranceMetadata metadata)` for utterance-based adapters
-- events:
- - `InterimTranscriptReceived`
- - `FinalTranscriptReceived`
- - `RecognitionFaulted`
-
-Likely first adapters:
-
-- `WindowsSpeechToTextAdapter`
- only if Windows gives us a clean explicit-audio-input path
-- `StreamingCloudSpeechToTextAdapter`
- for providers that accept pushed PCM/audio streams
-- `UtteranceCloudSpeechToTextAdapter`
- for providers that still expect bounded utterance uploads
-
## Selected-Device Roadmap
The current selected-device position is now:
- selected non-default speaker: implemented
-- selected/default microphone binding for `AudioGraph` capture: implemented
-- selected non-default microphone for actual transcript generation: not implemented yet
-
-Recommended engineering order:
-
-1. keep the current selected-speaker playback support
-2. extend the live `VoiceCaptureService` path into the STT side
-3. move Talk Mode input from `SpeechRecognizer` ownership to captured PCM frames
-4. introduce `ISpeechToTextAdapter`
-5. complete explicit selected-microphone transcript generation
-6. then revisit duplex/barge-in and streaming STT
+- selected/default microphone binding for `SpeechRecognizer` capture: implemented
+- selected non-default microphone for actual transcript generation: not implemented yet (requires `AudioGraph` support)
## Control Flow
@@ -792,7 +724,7 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
VoiceCap-->>Gateway: VoiceStatusInfo
- Gateway->>VoiceCap: voice.skip(reason=...)
+ Gateway->>VoiceCap: voice.response.skip(reason=...)
VoiceCap->>Coord: SkipCurrentReply()
Coord-->>VoiceCap: VoiceStatusInfo
VoiceCap-->>Gateway: VoiceStatusInfo
@@ -812,31 +744,19 @@ Coord-->>VoiceCap: VoiceStatusInfo(state=ListeningContinuously)
- `WindowsNodeClient` remains the gateway/node transport
- existing node capability registration remains the integration pattern
- current request/response transport remains the v1 control plane
-- `TalkMode` should reuse existing `chat.send` message flow instead of inventing an audio-upload protocol
-### New Components Expected Later
+### Supporting Components In Current Use
- `VoiceCapability` in `OpenClaw.Shared.Capabilities`
- `VoiceCaptureService` in `OpenClaw.Tray.WinUI.Services`
-- `VoiceWakeService` in `OpenClaw.Tray.WinUI.Services`
- `VoiceChatCoordinator` in `OpenClaw.Tray.WinUI.Services`
-- `VoicePlaybackService` in `OpenClaw.Tray.WinUI.Services`
-
-## Provider Direction
+- `VoiceRepeaterWindow` in `OpenClaw.Tray.WinUI.Windows`
+- `WebChatWindow` in `OpenClaw.Tray.WinUI.Windows`
-Provider support is now part of the Windows voice subsystem roadmap, not a hypothetical extension:
+### Components Still Expected Later
-- `MiniMax` and `ElevenLabs` TTS are both expressed through built-in catalog contracts
-- additional HTTP or WebSocket TTS providers can be added by extending the shipped catalog without recompiling the tray app itself
-- Windows STT remains the active speech-recognition baseline until a non-Windows STT provider is deliberately added
-
-The Windows node still keeps provider choice bounded:
-
-- local tray settings choose the provider ids
-- local tray settings store the provider secrets and editable values for now
-- OpenClaw still owns the conversation/session flow
-
-This keeps the provider surface narrow while still meeting the required MiniMax/ElevenLabs support direction.
+- `VoiceWakeService` in `OpenClaw.Tray.WinUI.Services`
+- a dedicated `VoicePlaybackService` seam when playback is split out of `VoiceService`
## Parity with macOS Node
@@ -849,7 +769,7 @@ Status values used below:
| macOS feature | Current Windows state | Notes |
|---|---|---|
-| Talk Mode continuous loop (`listen -> chat.send(main) -> wait -> speak`) | `Supported` | Windows Talk Mode uses direct `chat.send` on the active main session and loops back to listening after reply playback. |
+| Talk Mode continuous loop (`listen -> chat.send(main) -> wait -> speak`) | `Supported` | Windows Talk Mode uses direct `chat.send` on the tray voice target session (`agent:main:main` today, while still accepting the `main` alias on replies) and loops back to listening after reply playback. |
| Talk Mode sends after a short silence window | `Supported` | The current runtime finalizes on recognition pause and uses configurable Talk Mode silence settings. |
| Talk Mode visible phase transitions (`Listening -> Thinking -> Speaking`) | `Partial` | Runtime states, tray icon changes, and the compact voice repeater window exist, but there is no always-visible overlay yet. |
| Talk Mode always-on overlay with click-to-stop / click-X controls | `NotSupported (planned)` | Windows currently has a tray icon, a manually-opened compact repeater window, and WebChat draft mirroring, but no always-on overlay surface. |
@@ -868,37 +788,42 @@ Status values used below:
| Voice mic device selection | `Partial` | When `Windows Speech Recognition` is selected, Settings now locks both audio device pickers to the system defaults. Explicit per-device transcription routing remains a future AudioGraph/streaming-route feature. |
| Voice Wake send / trigger chimes | `NotSupported (planned)` | Windows currently has no configurable trigger/send sounds. |
-## Feature List (Backlog)
+## Feature List - Backlog - Not in Order, except maybe the first two ;)
-### Story: True selected-microphone transcription support
+### Story: Streaming STT Capture Pipeline
-Make actual STT transcription follow the selected microphone device, not just the `AudioGraph` capture path.
+Implement `AudioGraph` to create an extensible streaming speech input pipeline, rather than the current self-contained `Windows.Media.SpeechRecognizer` pipeline.
-Notes:
+This will allow us to mix/match components, and reduce latency.
-- current Windows Talk Mode capture can bind to the selected mic through `VoiceCaptureService`
-- final transcript generation still follows the Windows speech-input path rather than the selected device id
-- the implementation should complete the planned `AudioGraph` -> `ISpeechToTextAdapter` migration so the chosen microphone controls the whole input pipeline
+- Will support Cloud or Local http/ws providers (including Microsoft Foundry Local/OpenAI Whisper/etc)
+- Will support Embedded sherpa-onnx engine for user-defined/downloaded models
+- This will enable selection of best of class model for required use/language
-### Story: Generic http/ws streaming STT provider
+### Story: True streaming TTS playback
-Implement the catalog-driven generic HTTP/WebSocket STT adapter shown in Settings as `http/ws (coming soon)`.
+Start speaking assistant replies from the first usable audio chunk instead of waiting for a complete playable stream.
Notes:
-- this route is intended to support a broad class of stand-alone cloud or local streaming models
-- it should remain visible but disabled in Settings until the adapter contract is real and testable
-- it should become the shared base path for provider-specific adapters when a generic contract is sufficient
+- the current implementation uses WebSocket transport for MiniMax, but still buffers the entire audio response before playback begins
+- `firstChunk=...ms` in the log is currently provider-chunk arrival time, not actual speech-start time
+- implement a playback path that can consume incremental audio data as it arrives from the provider
+- the provider catalog contract should remain transport-driven and provider-agnostic, so streaming behavior should be expressed through the existing TTS contract model rather than hard-coded for MiniMax
+- preserve the existing queued reply behavior, skip support, and late-reply handling while switching playback to progressive output
+- add timing logs that separate `firstChunk`, `playbackStart`, and `playbackEnd` so latency improvements are measurable
-### Story: Talk Mode overlay and visible phase parity
+### Story: True selected-microphone transcription support
-Add a Talk Mode overlay that makes `Listening`, `Thinking`, and `Speaking` visible to the user in the same way the macOS experience does.
+Make actual STT transcription follow the selected microphone device, not just the default device.
-Notes:
+- depends on `AudioGraph` support
+
+
+### Story: Talk Mode overlay and visible phase parity
+
+Add a Talk Mode overlay that makes `Listening`, `Thinking`, and `Speaking` visible to the user in the same way the macOS experience does. Probably via the current voice mode form. I haven't actually seen the macOS version, so not sure how they do it.
-- the current tray icon and status window are not equivalent to an always-visible Talk Mode surface
-- the overlay should expose phase transitions clearly and support later stop / dismiss controls
-- this should be designed alongside the existing compact voice-strip idea so the two UI surfaces do not conflict
### Story: Talk Mode overlay controls
@@ -910,15 +835,6 @@ Notes:
- Windows currently requires tray or settings interaction instead
- this should plug into the shared runtime control API rather than directly manipulating `VoiceService`
-### Story: Same-as-typing WebChat parity for Talk Mode
-
-Decide whether Windows Talk Mode should optionally route sent transcripts through a typed-chat equivalent path so WebChat behavior matches manual typing more closely.
-
-Notes:
-
-- current Windows Talk Mode intentionally uses direct `chat.send`
-- replies still appear in WebChat through session updates, but the send path is not literally the same as typed WebChat submission
-- this should only be revisited if the memory / prompt-shaping issues can be fixed without reintroducing transport fragility
### Story: Voice directives in replies
@@ -990,7 +906,6 @@ Allow the node to keep listening while it is speaking, so the user can interrupt
Notes:
- the current Windows implementation is half-duplex: recognition is stopped or ignored while a reply is being spoken
-- a true implementation will require a lower-level audio pipeline rather than only `SpeechRecognizer` plus `SpeechSynthesizer`
- practical requirements are likely to include:
- microphone capture that can remain active during playback
- acoustic echo cancellation / echo suppression
@@ -1019,16 +934,6 @@ Notes:
- it should pause the wake runtime while push-to-talk capture is active, then resume it cleanly afterward
- Windows-specific hotkey and permissions behavior should be documented explicitly once chosen
-### Story: Voice Wake overlay lifecycle
-
-Add the Windows Voice Wake overlay and harden its lifecycle so dismissing or hiding the UI can never leave the wake runtime dead.
-
-Notes:
-
-- support committed and volatile transcript presentation
-- manual dismiss must never block recognizer restart
-- overlay and runtime should be coordinated through a session controller rather than direct UI coupling
-
### Story: Voice Wake settings parity
Add the user-facing Voice Wake settings surface that exists on macOS.
@@ -1081,99 +986,3 @@ Notes:
- include a hard maximum capture duration
- expose the tuning through voice settings rather than hard-coded constants alone
-
-### Story: Compact Voice Status Strip
-
-Add an optional tiny always-on-top voice strip window for Talk Mode.
-
-Notes:
-
-- user-configurable show / hide
-- intended to be a minimal one-line-high display with a small amount of padding
-- should show:
- - current voice state
- - rolling live transcript while listening
- - rolling assistant text while speaking
- - a skip / cut-off control while speaking
-- the runtime control surface should drive this window rather than the window manipulating `VoiceService` internals directly
-- if implemented later, the strip should use the shared runtime control API described elsewhere in this document.
-
-### Story: True streaming TTS playback
-
-Start speaking assistant replies from the first usable audio chunk instead of waiting for a complete playable stream.
-
-Notes:
-
-- the current implementation uses WebSocket transport for MiniMax, but still buffers the entire audio response before playback begins
-- `firstChunk=...ms` in the log is currently provider-chunk arrival time, not actual speech-start time
-- implement a playback path that can consume incremental audio data as it arrives from the provider
-- the provider catalog contract should remain transport-driven and provider-agnostic, so streaming behavior should be expressed through the existing TTS contract model rather than hard-coded for MiniMax
-- preserve the existing queued reply behavior, skip support, and late-reply handling while switching playback to progressive output
-- add timing logs that separate `firstChunk`, `playbackStart`, and `playbackEnd` so latency improvements are measurable
-
-## Commit Timeline
-
-Append one new line to this timeline for every future voice-mode commit.
-
-- `2026-03-21` `be624fe` Added the initial Windows voice-mode foundation and the first AlwaysOn runtime.
-- `2026-03-23` `f40ffc3` Fixed voice chat transport and reply routing.
-- `2026-03-23` `a81d31e` Added configurable voice settings and the setup UI.
-- `2026-03-23` `197a89b` Integrated always-on voice mode with the tray chat workflow.
-- `2026-03-23` `1340bde` Fixed tray voice startup and chat window submission.
-- `2026-03-23` `aed8cb8` Removed the stale always-on autosubmit setting.
-- `2026-03-23` `25dd06b` Added focused coordinator coverage for tray voice chat.
-- `2026-03-23` `1336472` Addressed review findings and hardened the voice runtime.
-- `2026-03-23` `0f1028a` Documented MiniMax and ElevenLabs as required provider support.
-- `2026-03-23` `2c8a46d` Hardened tray chat voice message handling.
-- `2026-03-23` `fdbf48e` Fixed voice transport connection task reuse.
-- `2026-03-23` `b556c64` Grouped voice runtime services under `Services/Voice`.
-- `2026-03-23` `7f31c12` Implemented MiniMax TTS for voice mode.
-- `2026-03-23` `c64f168` Added editable TTS provider settings to voice mode.
-- `2026-03-23` `907a1a0` Moved voice settings into the main settings window.
-- `2026-03-23` `6dba89b` Extracted the hosted voice settings panel from the settings window.
-- `2026-03-23` `ded41a2` Generalized cloud TTS providers through catalog contracts.
-- `2026-03-23` `199e534` Renamed voice modes to `VoiceWake` and `TalkMode`.
-- `2026-03-23` `47efc3e` Moved voice settings below the node mode toggle.
-- `2026-03-23` `85d7b90` Made cloud TTS voice settings fully catalog-driven.
-- `2026-03-23` `c1cc0ff` Shipped the voice provider catalog with the tray app.
-- `2026-03-23` `83f05ee` Instrumented voice output latency and reduced TTS buffering.
-- `2026-03-23` `d137409` Tightened talk-mode speech-recognition filtering.
-- `2026-03-23` `05d7bae` Switched MiniMax TTS to the `api-uw` endpoint.
-- `2026-03-23` `5efcebf` Added catalog-driven MiniMax WebSocket TTS.
-- `2026-03-23` `45ff8f8` Fixed voice restart after settings save.
-- `2026-03-23` `71d0de4` Fixed MiniMax WebSocket voice playback routing.
-- `2026-03-23` `91ccec3` Added dynamic tray icons for voice states.
-- `2026-03-23` `2ff57fc` Added pre-response voice latency timing logs.
-- `2026-03-23` `ffa3fa2` Kept Talk Mode alive after input failures.
-- `2026-03-25` `c3ded30` Queued Talk Mode replies for sequential playback.
-- `2026-03-25` `82e2958` Added voice control and configuration APIs.
-- `2026-03-25` `06d508f` Accepted late Talk Mode replies after timeout.
-- `2026-03-25` Delayed the Talk Mode ready state until recognizer warm-up completes, so the UI does not advertise listening before the first recognition session has settled.
-- `2026-03-25` Added a recognizer health-check watchdog so Talk Mode recycles a started-but-deaf recognition session instead of waiting minutes for Windows to cancel it.
-- `2026-03-25` Reverted Talk Mode to a single direct `chat.send` path and reduced the tray chat integration back to draft mirroring only.
-- `2026-03-25` Added a configurable tray-chat display filter for injected `<relevant-memories>` blocks.
-- `2026-03-25` Fixed the recognizer watchdog so a stalled Talk Mode session is actually canceled and restarted instead of logging a recycle and then remaining deaf.
-- `2026-03-25` Rebuilt the Windows speech recognizer after repeated deaf `UserCanceled` and watchdog-recycle failures instead of repeatedly restarting the same broken recognizer instance.
-- `2026-03-25` Fixed the tray-chat draft mirror so it clears immediately after direct send, and primed media playback before `Play()` so spoken replies stop clipping their opening syllables.
-- `2026-03-25` Added a backlog story for true streaming TTS playback, including provider-catalog and latency-measurement notes.
-- `2026-03-25` Corrected the MiniMax WebSocket request sequence by sending `task_finish` before reading audio, and added a guarded fallback that promotes a recent longer hypothesis when Windows only finalizes the tail of an utterance.
-- `2026-03-25` Added live default-microphone change handling for Talk Mode, so using the system default capture device now refreshes the recognizer when Windows switches to a new default mic such as AirPods.
-- `2026-03-25` Applied `Voice.OutputDeviceId` to Talk Mode playback via `MediaPlayer.AudioDevice`, so selected non-default speaker devices now work even though explicit non-default microphone capture is still pending.
-- `2026-03-25` Hardened the gateway preview fallback so a bare final chat event does not replay the previous assistant reply when `sessions.preview` lags behind the real session update.
-- `2026-03-25` Updated the voice-mode architecture document with an accurate current Talk Mode flow, the planned `AudioGraph` input design, the STT adapter seam, and the selected-device roadmap.
-- `2026-03-25` Added `VoiceCaptureService` on `AudioGraph`, wired it into Talk Mode lifecycle, and started using live capture signal as part of recognizer health and device-refresh handling while transcript generation still remains on the Windows speech recognizer.
-- `2026-03-25` Fixed `VoiceCaptureService` frame-buffer interop for the current WinRT projection so AudioGraph quantum processing can read frame data instead of throwing invalid-cast warnings on every capture quantum.
-- `2026-03-25` Changed Talk Mode readiness so `Listening` only appears after the AudioGraph capture path is delivering frames and the recognizer warm-up completes, making it a user-facing ΓÇ£start talking nowΓÇ¥ signal rather than a timer-only state.
-- `2026-03-25` Changed recognizer recovery so silence no longer churns the Talk Mode pipeline; recycling now only happens when sustained capture speech is present but the Windows recognizer still produces no activity.
-- `2026-03-25` Delayed deaf-recognizer recovery until post-speech silence and added a completion fallback that submits the last recent hypothesis when Windows ends a session without ever producing a final result.
-- `2026-03-25` Fixed overlapping Talk Mode recovery watchdogs so a new recognition session no longer launches duplicate deaf-recognizer recycle loops.
-- `2026-03-25` Fixed Talk Mode media playback failure handling so a failed reply no longer leaks an unobserved task exception after the reply audio arrives.
-- `2026-03-25` Generalized the catalog-driven WebSocket TTS client to support providers without explicit connect/start acknowledgements and switched ElevenLabs to the `stream-input` WebSocket API with default voice settings.
-- `2026-03-25` Reworked the dynamic tray icon language so listening shows activity waves around the headphones and speaking uses a microphone badge instead of a speaker icon.
-- `2026-03-25` Adjusted the ElevenLabs `stream-input` contract to send `xi_api_key` in the init message and `flush: true` with the text turn so short replies are emitted promptly instead of being buffered then closed.
-- `2026-03-25` Stopped sending an immediate ElevenLabs EOS message before reading audio, so single-turn `stream-input` replies rely on `flush: true` instead of prematurely closing the socket.
-- `2026-03-25` Switched ElevenLabs `stream-input` to the published text-generation pattern: send the text turn with a trailing-space variant plus `try_trigger_generation: true`, then send EOS, and include the WebSocket close status in playback errors.
-- `2026-03-26` Merged the latest `feature/voice-mode` updates and added explicit recognizer completion decision logging so Talk Mode now records why a stopped Windows speech session did or did not restart/rebuild after idle or watchdog completions.
-- `2026-03-26` Cleared the controlled-restart latch as soon as a new recognition session successfully starts, and on failed resume attempts, so Talk Mode does not get stranded in `controlled-restart-in-progress` after an idle/deaf recognizer recycle.
-- `2026-03-26` Split Talk Mode STT startup into explicit route classes: the built-in `windows` provider now uses a pure `Windows.Media` path with no AudioGraph, while `foundry-local` and `sherpa-onnx` are separated into dedicated future AudioGraph/embedded route classes instead of sharing the Windows pipeline.
-- `2026-03-26` Updated Talk Mode to mirror sent voice turns back into the tray chat window, clear stale low-confidence drafts, locally reprompt when speech activity ends without a usable transcript, surface provider and mode descriptions in Settings, expose `http/ws` and `sherpa-onnx` as coming-soon STT catalog entries, and lock device selection to system defaults when Windows Speech Recognition is selected.
diff --git a/src/OpenClaw.Shared/SettingsData.cs b/src/OpenClaw.Shared/SettingsData.cs
index 4dbd06b..56d8e4c 100644
--- a/src/OpenClaw.Shared/SettingsData.cs
+++ b/src/OpenClaw.Shared/SettingsData.cs
@@ -70,6 +70,8 @@ private static string MigrateLegacyVoiceJson(string json)
public sealed class VoiceRepeaterWindowSettings
{
public bool AutoScroll { get; set; } = true;
+ public bool FloatingEnabled { get; set; } = true;
+ public bool HasSavedPlacement { get; set; }
public double TextSize { get; set; } = 13;
public int? Width { get; set; }
public int? Height { get; set; }
diff --git a/src/OpenClaw.Shared/VoiceModeSchema.cs b/src/OpenClaw.Shared/VoiceModeSchema.cs
index 6ba5202..e47af8c 100644
--- a/src/OpenClaw.Shared/VoiceModeSchema.cs
+++ b/src/OpenClaw.Shared/VoiceModeSchema.cs
@@ -15,7 +15,7 @@ public static class VoiceCommands
public const string Stop = "voice.stop";
public const string Pause = "voice.pause";
public const string Resume = "voice.resume";
- public const string Skip = "voice.skip";
+ public const string Skip = "voice.response.skip";
private static readonly ReadOnlyCollection<string> s_all = Array.AsReadOnly(
[
@@ -61,8 +61,8 @@ public sealed class VoiceSettings
{
public VoiceActivationMode Mode { get; set; } = VoiceActivationMode.Off;
public bool Enabled { get; set; }
+ public bool ShowRepeaterAtStartup { get; set; } = true;
public bool ShowConversationToasts { get; set; }
- public bool StripInjectedMemoriesInChat { get; set; } = true;
public string SpeechToTextProviderId { get; set; } = VoiceProviderIds.Windows;
public string TextToSpeechProviderId { get; set; } = VoiceProviderIds.Windows;
public string? InputDeviceId { get; set; }
diff --git a/src/OpenClaw.Tray.WinUI/App.xaml.cs b/src/OpenClaw.Tray.WinUI/App.xaml.cs
index b5cd0cd..5b7fcc8 100644
--- a/src/OpenClaw.Tray.WinUI/App.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/App.xaml.cs
@@ -311,6 +311,11 @@ protected override async void OnLaunched(LaunchActivatedEventArgs args)
HandleDeepLink(startupDeepLink);
}
+ if (ShouldShowVoiceRepeaterAtStartup())
+ {
+ _dispatcherQueue?.TryEnqueue(ShowVoiceModeSettings);
+ }
+
Logger.Info("Application started (WinUI 3)");
}
@@ -783,6 +788,14 @@ private bool CanQuickToggleVoiceMode()
return _settings.Voice.Enabled && _settings.Voice.Mode != VoiceActivationMode.Off;
}
+ private bool ShouldShowVoiceRepeaterAtStartup()
+ {
+ return _settings?.EnableNodeMode == true &&
+ _settings.Voice.Enabled &&
+ _settings.Voice.Mode != VoiceActivationMode.Off &&
+ _settings.Voice.ShowRepeaterAtStartup;
+ }
+
private string GetVoiceQuickToggleLabel()
{
var status = _voiceService?.CurrentStatus;
@@ -1864,18 +1877,6 @@ private async void OnSettingsSaved(object? sender, EventArgs e)
_globalHotkey?.Unregister();
}
- if (_webChatWindow != null && !_webChatWindow.IsClosed)
- {
- try
- {
- await _webChatWindow.SetStripInjectedMemoriesEnabledAsync(_settings.Voice.StripInjectedMemoriesInChat);
- }
- catch (Exception ex)
- {
- Logger.Warn($"Failed to refresh tray chat cleanup setting: {ex.Message}");
- }
- }
-
_voiceRepeaterWindow?.RefreshStatus();
_voiceModeWindow?.RefreshStatus();
@@ -1889,8 +1890,7 @@ private void ShowWebChat()
{
_webChatWindow = new WebChatWindow(
_settings!.GatewayUrl,
- _settings.Token,
- _settings.Voice.StripInjectedMemoriesInChat);
+ _settings.Token);
_webChatWindow.Closed += (s, e) =>
{
_voiceChatCoordinator?.DetachWindow(_webChatWindow);
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
index fa6721f..cadffd4 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml
@@ -32,6 +32,9 @@
TextWrapping="Wrap"/>
</Grid>
+ <CheckBox x:Name="VoiceShowRepeaterAtStartupCheckBox"
+ Content="Show VoiceMode repeater form at startup"/>
+
<Grid ColumnSpacing="12">
<Grid.ColumnDefinitions>
<ColumnDefinition Width="2*"/>
@@ -99,8 +102,6 @@
<CheckBox x:Name="VoiceConversationToastsCheckBox"
Content="Show voice transcripts and replies as toasts"/>
- <CheckBox x:Name="VoiceStripInjectedMemoriesCheckBox"
- Content="Hide injected &lt;relevant-memories&gt; blocks in the tray chat window"/>
<TextBlock x:Name="VoiceSettingsInfoTextBlock"
Style="{StaticResource CaptionTextBlockStyle}"
diff --git a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
index 73c1db1..30290a4 100644
--- a/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Controls/VoiceSettingsPanel.xaml.cs
@@ -46,8 +46,8 @@ public async Task ApplyAsync(SettingsManager settings)
{
Mode = GetSelectedVoiceMode(),
Enabled = GetSelectedVoiceMode() != VoiceActivationMode.Off,
+ ShowRepeaterAtStartup = (VoiceShowRepeaterAtStartupCheckBox.IsChecked ?? true) && GetSelectedVoiceMode() != VoiceActivationMode.Off,
ShowConversationToasts = VoiceConversationToastsCheckBox.IsChecked ?? false,
- StripInjectedMemoriesInChat = VoiceStripInjectedMemoriesCheckBox.IsChecked ?? true,
SpeechToTextProviderId = (VoiceSpeechToTextProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows,
TextToSpeechProviderId = (VoiceTextToSpeechProviderComboBox.SelectedItem as VoiceProviderOption)?.Id ?? VoiceProviderIds.Windows,
InputDeviceId = (VoiceInputDeviceComboBox.SelectedItem as DeviceOption)?.DeviceId,
@@ -96,8 +96,10 @@ private void LoadVoiceSettings()
LoadVoiceProviders();
SelectVoiceMode(_settings.Voice.Mode);
UpdateVoiceSelectionDescriptions();
+ VoiceShowRepeaterAtStartupCheckBox.IsChecked = _settings.Voice.Mode == VoiceActivationMode.Off
+ ? false
+ : _settings.Voice.ShowRepeaterAtStartup;
VoiceConversationToastsCheckBox.IsChecked = _settings.Voice.ShowConversationToasts;
- VoiceStripInjectedMemoriesCheckBox.IsChecked = _settings.Voice.StripInjectedMemoriesInChat;
UpdateVoiceProviderSettingsEditor();
UpdateVoiceSettingsInfo();
}
@@ -244,7 +246,7 @@ private void UpdateVoiceSettingsInfo()
}
VoiceSettingsInfoTextBlock.Text =
- $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}. Chat cleanup: {(VoiceStripInjectedMemoriesCheckBox.IsChecked ?? true ? "on" : "off")}.{fallbackNotice}";
+ $"Mode: {VoiceDisplayHelper.GetModeLabel(GetSelectedVoiceMode())}. STT: {stt}. TTS: {tts}. Listen: {input}. Talk: {output}.{fallbackNotice}";
}
private void UpdateDeviceSelectionAvailability()
@@ -391,6 +393,11 @@ private async void OnRefreshVoiceDevices(object sender, RoutedEventArgs e)
private void OnVoiceModeChanged(object sender, SelectionChangedEventArgs e)
{
+ var mode = GetSelectedVoiceMode();
+ VoiceShowRepeaterAtStartupCheckBox.IsChecked = mode == VoiceActivationMode.Off
+ ? false
+ : (VoiceShowRepeaterAtStartupCheckBox.IsChecked ?? true);
+ VoiceShowRepeaterAtStartupCheckBox.IsEnabled = mode != VoiceActivationMode.Off;
UpdateVoiceSelectionDescriptions();
UpdateVoiceSettingsInfo();
}
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
index 1f85f5e..50bca27 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceService.cs
@@ -25,13 +25,14 @@ namespace OpenClawTray.Services.Voice;
public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoiceRuntimeControlApi, IDisposable
{
- private const string DefaultSessionKey = "main";
+ private const string DefaultSessionKey = "agent:main:main";
private const int HResultSpeechPrivacyDeclined = unchecked((int)0x80045509);
private static readonly TimeSpan TransportConnectTimeout = TimeSpan.FromSeconds(10);
private static readonly TimeSpan ReplyTimeout = TimeSpan.FromSeconds(45);
private static readonly TimeSpan LateReplyGraceWindow = TimeSpan.FromMinutes(2);
private static readonly TimeSpan InitialRecognitionReadyDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan DuplicateTranscriptWindow = TimeSpan.FromMilliseconds(750);
+ private static readonly TimeSpan DuplicateAssistantReplyWindow = TimeSpan.FromSeconds(5);
private static readonly TimeSpan HypothesisPromotionWindow = TimeSpan.FromSeconds(2);
private static readonly TimeSpan RecognitionResumeRetryDelay = TimeSpan.FromMilliseconds(500);
private static readonly TimeSpan QueuedReplyPlaybackGap = TimeSpan.FromMilliseconds(500);
@@ -71,6 +72,9 @@ public sealed class VoiceService : IVoiceRuntime, IVoiceConfigurationApi, IVoice
private string? _currentReplyPreview;
private string? _lateReplySessionKey;
private DateTime? _lateReplyGraceUntilUtc;
+ private string? _lastAcceptedAssistantReplyText;
+ private string? _lastAcceptedAssistantReplySessionKey;
+ private DateTime _lastAcceptedAssistantReplyUtc;
private bool _disposed;
public event EventHandler<VoiceConversationTurnEventArgs>? ConversationTurnAvailable;
@@ -659,18 +663,19 @@ private async Task EnsureChatTransportAsync(CancellationToken cancellationToken)
out shouldStartConnection);
_transportReadyTcs = readySource;
- if (shouldStartConnection)
- {
- _chatTransportStatus = ConnectionStatus.Connecting;
-
- if (existingClient == null)
+ if (shouldStartConnection)
{
- _chatClient = new OpenClawGatewayClient(_settings.GatewayUrl, _settings.Token, _logger);
- _chatClient.StatusChanged += OnChatTransportStatusChanged;
- _chatClient.ChatMessageReceived += OnChatMessageReceived;
- existingClient = _chatClient;
+ _chatTransportStatus = ConnectionStatus.Connecting;
+
+ if (existingClient == null)
+ {
+ _chatClient = new OpenClawGatewayClient(_settings.GatewayUrl, _settings.Token, _logger);
+ _chatClient.StatusChanged += OnChatTransportStatusChanged;
+ _chatClient.ChatMessageReceived += OnChatMessageReceived;
+ _chatClient.SessionPreviewUpdated += OnSessionPreviewUpdated;
+ existingClient = _chatClient;
+ }
}
- }
}
if (shouldStartConnection)
@@ -1176,10 +1181,24 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
return;
}
- string text;
- bool shouldStartPlaybackLoop = false;
- bool acceptedViaLateReplyGrace = false;
+ await AcceptAssistantReplyAsync(args.SessionKey, args.Message, "chat event");
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ }
+ }
+
+ private async void OnSessionPreviewUpdated(object? sender, SessionsPreviewPayloadInfo payload)
+ {
+ try
+ {
+ if (payload.Previews == null || payload.Previews.Count == 0)
+ {
+ return;
+ }
+ string? expectedSessionKey;
lock (_gate)
{
if (!_status.Running || _status.Mode != VoiceActivationMode.TalkMode)
@@ -1187,67 +1206,117 @@ private async void OnChatMessageReceived(object? sender, ChatMessageEventArgs ar
return;
}
- if (!IsMatchingSessionKey(args.SessionKey, GetCurrentVoiceSessionKey()))
+ expectedSessionKey = GetCurrentVoiceSessionKey();
+ }
+
+ foreach (var preview in payload.Previews)
+ {
+ if (!IsMatchingSessionKey(preview.Key, expectedSessionKey))
{
- return;
+ continue;
}
- acceptedViaLateReplyGrace = ShouldAcceptLateAssistantReply(
- _awaitingReply,
- _isSpeaking,
- _pendingAssistantReplies.Count,
- _lateReplySessionKey,
- _lateReplyGraceUntilUtc,
- args.SessionKey,
- DateTime.UtcNow);
+ var assistantText = preview.Items
+ .LastOrDefault(item =>
+ string.Equals(item.Role, "assistant", StringComparison.OrdinalIgnoreCase) &&
+ !string.IsNullOrWhiteSpace(item.Text))
+ ?.Text;
- if (!ShouldAcceptAssistantReply(_awaitingReply, _isSpeaking, _pendingAssistantReplies.Count, acceptedViaLateReplyGrace))
+ if (string.IsNullOrWhiteSpace(assistantText))
{
- return;
+ continue;
}
- _awaitingReply = false;
- if (acceptedViaLateReplyGrace)
- {
- _lateReplySessionKey = null;
- _lateReplyGraceUntilUtc = null;
- }
- text = PrepareReplyForSpeech(args.Message);
+ await AcceptAssistantReplyAsync(preview.Key, assistantText, "session preview");
+ return;
+ }
+ }
+ catch (Exception ex)
+ {
+ _logger.Warn($"Voice session preview handler failed: {ex.Message}");
+ }
+ }
+
+ private async Task AcceptAssistantReplyAsync(string? sessionKey, string? rawText, string source)
+ {
+ if (string.IsNullOrWhiteSpace(rawText))
+ {
+ return;
+ }
+
+ string text;
+ bool acceptedViaLateReplyGrace;
+ bool shouldResumeRecognition = false;
+ bool shouldStartPlaybackLoop = false;
+ var utcNow = DateTime.UtcNow;
+
+ lock (_gate)
+ {
+ if (!_status.Running || _status.Mode != VoiceActivationMode.TalkMode)
+ {
+ return;
+ }
+
+ if (!IsMatchingSessionKey(sessionKey, GetCurrentVoiceSessionKey()))
+ {
+ return;
}
+ acceptedViaLateReplyGrace = ShouldAcceptLateAssistantReply(
+ _awaitingReply,
+ _isSpeaking,
+ _pendingAssistantReplies.Count,
+ _lateReplySessionKey,
+ _lateReplyGraceUntilUtc,
+ sessionKey,
+ utcNow);
+
+ if (!ShouldAcceptAssistantReply(_awaitingReply, _isSpeaking, _pendingAssistantReplies.Count, acceptedViaLateReplyGrace))
+ {
+ return;
+ }
+
+ text = PrepareReplyForSpeech(rawText);
+ if (ShouldSuppressDuplicateAssistantReply(sessionKey, text, utcNow))
+ {
+ return;
+ }
+
+ _awaitingReply = false;
if (acceptedViaLateReplyGrace)
{
- _logger.Warn($"Voice accepted late assistant reply after timeout for session {args.SessionKey}");
+ _lateReplySessionKey = null;
+ _lateReplyGraceUntilUtc = null;
}
+ RememberAcceptedAssistantReply(sessionKey, text, utcNow);
+
if (string.IsNullOrWhiteSpace(text))
{
- var shouldResumeRecognition = false;
- lock (_gate)
+ if (_status.Running && !_replyPlaybackLoopActive)
{
- if (_status.Running && !_replyPlaybackLoopActive)
- {
- _status = BuildRunningStatus(
- VoiceActivationMode.TalkMode,
- _status.SessionKey,
- VoiceRuntimeState.Arming,
- _status.LastError);
- shouldResumeRecognition = true;
- }
- }
-
- if (shouldResumeRecognition)
- {
- await ResumeRecognitionSessionAsync(CancellationToken.None, "empty assistant reply");
+ _status = BuildRunningStatus(
+ VoiceActivationMode.TalkMode,
+ _status.SessionKey,
+ VoiceRuntimeState.Arming,
+ _status.LastError);
+ shouldResumeRecognition = true;
}
- return;
}
+ else
+ {
+ QueueAssistantReplyForPlayback(text, sessionKey, out shouldStartPlaybackLoop);
+ }
+ }
- QueueAssistantReplyForPlayback(text, args.SessionKey, out shouldStartPlaybackLoop);
+ if (acceptedViaLateReplyGrace)
+ {
+ _logger.Warn($"Voice accepted late assistant reply after timeout for session {sessionKey} via {source}");
}
- catch (Exception ex)
+
+ if (string.IsNullOrWhiteSpace(text) && shouldResumeRecognition)
{
- _logger.Warn($"Voice chat message handler failed: {ex.Message}");
+ await ResumeRecognitionSessionAsync(CancellationToken.None, "empty assistant reply");
}
}
@@ -1465,6 +1534,10 @@ private async Task WarmSpeechPlaybackPipelineAsync(
catch (OperationCanceledException)
{
}
+ catch (ObjectDisposedException)
+ {
+ _logger.Info("Voice playback warm-up skipped because playback resources were disposed during initialization.");
+ }
catch (Exception ex)
{
_logger.Warn($"Voice playback warm-up failed: {ex.Message}");
@@ -1499,7 +1572,7 @@ internal static bool ShouldAcceptLateAssistantReply(
queuedReplyCount == 0 &&
!string.IsNullOrWhiteSpace(lateReplySessionKey) &&
!string.IsNullOrWhiteSpace(incomingSessionKey) &&
- string.Equals(lateReplySessionKey, incomingSessionKey, StringComparison.OrdinalIgnoreCase) &&
+ IsMatchingSessionKey(incomingSessionKey, lateReplySessionKey) &&
lateReplyGraceUntilUtc.HasValue &&
utcNow <= lateReplyGraceUntilUtc.Value;
}
@@ -1769,9 +1842,9 @@ private static async Task PreloadStreamAsync(
}
finally
{
- player.MediaOpened -= openedHandler;
- player.MediaFailed -= failedHandler;
- player.Source = null;
+ try { player.MediaOpened -= openedHandler; } catch { }
+ try { player.MediaFailed -= failedHandler; } catch { }
+ try { player.Source = null; } catch { }
}
}
@@ -1836,10 +1909,10 @@ private static async Task PlayStreamAsync(
}
finally
{
- player.MediaOpened -= openedHandler;
- player.MediaEnded -= endedHandler;
- player.MediaFailed -= failedHandler;
- player.Source = null;
+ try { player.MediaOpened -= openedHandler; } catch { }
+ try { player.MediaEnded -= endedHandler; } catch { }
+ try { player.MediaFailed -= failedHandler; } catch { }
+ try { player.Source = null; } catch { }
}
}
@@ -2052,6 +2125,9 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
_currentReplyPreview = null;
_lateReplySessionKey = null;
_lateReplyGraceUntilUtc = null;
+ _lastAcceptedAssistantReplyText = null;
+ _lastAcceptedAssistantReplySessionKey = null;
+ _lastAcceptedAssistantReplyUtc = default;
playbackSkipCts = _playbackSkipCts;
_playbackSkipCts = null;
}
@@ -2088,6 +2164,7 @@ private async Task StopRuntimeResourcesAsync(bool updateStoppedStatus)
{
chatClient.StatusChanged -= OnChatTransportStatusChanged;
chatClient.ChatMessageReceived -= OnChatMessageReceived;
+ chatClient.SessionPreviewUpdated -= OnSessionPreviewUpdated;
try { await chatClient.DisconnectAsync(); } catch { }
try { chatClient.Dispose(); } catch { }
}
@@ -2126,7 +2203,28 @@ private static bool IsMatchingSessionKey(string? actualSessionKey, string? expec
private static bool IsMainSessionKey(string sessionKey)
{
- return sessionKey == DefaultSessionKey || sessionKey.Contains(":main:", StringComparison.Ordinal);
+ return string.Equals(sessionKey, "main", StringComparison.OrdinalIgnoreCase) ||
+ string.Equals(sessionKey, DefaultSessionKey, StringComparison.OrdinalIgnoreCase) ||
+ sessionKey.Contains(":main:", StringComparison.OrdinalIgnoreCase);
+ }
+
+ private bool ShouldSuppressDuplicateAssistantReply(string? sessionKey, string text, DateTime utcNow)
+ {
+ if (string.IsNullOrWhiteSpace(text))
+ {
+ return false;
+ }
+
+ return string.Equals(_lastAcceptedAssistantReplySessionKey, string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey, StringComparison.OrdinalIgnoreCase) &&
+ string.Equals(_lastAcceptedAssistantReplyText, text, StringComparison.Ordinal) &&
+ utcNow - _lastAcceptedAssistantReplyUtc <= DuplicateAssistantReplyWindow;
+ }
+
+ private void RememberAcceptedAssistantReply(string? sessionKey, string text, DateTime utcNow)
+ {
+ _lastAcceptedAssistantReplySessionKey = string.IsNullOrWhiteSpace(sessionKey) ? DefaultSessionKey : sessionKey;
+ _lastAcceptedAssistantReplyText = string.IsNullOrWhiteSpace(text) ? null : text;
+ _lastAcceptedAssistantReplyUtc = utcNow;
}
internal static bool ShouldRefreshRecognitionForDefaultCaptureDeviceChange(
@@ -2428,8 +2526,8 @@ private static VoiceSettings Clone(VoiceSettings source)
{
Mode = source.Mode,
Enabled = source.Enabled,
+ ShowRepeaterAtStartup = source.ShowRepeaterAtStartup,
ShowConversationToasts = source.ShowConversationToasts,
- StripInjectedMemoriesInChat = source.StripInjectedMemoriesInChat,
SpeechToTextProviderId = source.SpeechToTextProviderId,
TextToSpeechProviderId = source.TextToSpeechProviderId,
InputDeviceId = source.InputDeviceId,
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml
index 7714ead..c34f030 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml
@@ -98,26 +98,29 @@
Orientation="Horizontal"
Spacing="4">
<Button x:Name="PauseResumeButton"
- Width="30"
- Height="28"
+ Width="28"
+ Height="26"
+ Padding="0"
Click="OnPauseResumeClick"
ToolTipService.ToolTip="Pause or resume voice mode">
- <FontIcon x:Name="PauseResumeIcon" Glyph="&#xE769;" FontSize="11"/>
+ <SymbolIcon x:Name="PauseResumeIcon" Symbol="Pause"/>
</Button>
<Button x:Name="SkipReplyButton"
- Width="30"
- Height="28"
+ Width="28"
+ Height="26"
+ Padding="0"
Click="OnSkipReplyClick"
ToolTipService.ToolTip="Skip current reply">
- <FontIcon Glyph="&#xE72A;" FontSize="11"/>
+ <SymbolIcon Symbol="Forward"/>
</Button>
<Button x:Name="ViewSettingsButton"
- Width="30"
- Height="28"
+ Width="28"
+ Height="26"
+ Padding="0"
ToolTipService.ToolTip="Voice repeater settings">
- <FontIcon Glyph="&#xE713;" FontSize="11"/>
+ <SymbolIcon Symbol="Setting"/>
<Button.Flyout>
<Flyout Placement="TopEdgeAlignedRight">
<StackPanel Width="220" Spacing="8" Padding="8">
@@ -126,6 +129,11 @@
Checked="OnAutoScrollChanged"
Unchecked="OnAutoScrollChanged"/>
+ <CheckBox x:Name="FloatingEnabledCheckBox"
+ Content="Float above other windows"
+ Checked="OnFloatingEnabledChanged"
+ Unchecked="OnFloatingEnabledChanged"/>
+
<StackPanel Spacing="4">
<TextBlock Text="Text size"
Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
diff --git a/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs
index 6ff8a31..017a9cf 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/VoiceRepeaterWindow.xaml.cs
@@ -21,6 +21,7 @@ public sealed partial class VoiceRepeaterWindow : WindowEx, IVoiceChatWindow
private const int MaxConversationItems = 24;
private const int DefaultWidth = 360;
private const int DefaultHeight = 170;
+ private const int DefaultMargin = 12;
private const double DefaultTextSize = 13;
private const double DefaultCaptionSize = 10;
@@ -32,6 +33,9 @@ public sealed partial class VoiceRepeaterWindow : WindowEx, IVoiceChatWindow
private bool _controlActionInFlight;
private bool _suppressSettingsEvents;
+ private bool _suppressPlacementSave = true;
+ private bool _initialPlacementPending = true;
+ private bool _placementDirty;
private bool _autoScrollEnabled;
private double _messageFontSize = DefaultTextSize;
private double _captionFontSize = DefaultCaptionSize;
@@ -57,6 +61,7 @@ public VoiceRepeaterWindow(
ConversationItemsControl.ItemsSource = _conversationItems;
Closed += OnWindowClosed;
+ Activated += OnWindowActivated;
var dispatcherQueue = DispatcherQueue.GetForCurrentThread();
if (dispatcherQueue != null)
@@ -217,6 +222,19 @@ private void OnTextSizeSelectionChanged(object sender, SelectionChangedEventArgs
_settings.Save(logSuccess: false);
}
+ private void OnFloatingEnabledChanged(object sender, RoutedEventArgs e)
+ {
+ if (_suppressSettingsEvents)
+ {
+ return;
+ }
+
+ var enabled = FloatingEnabledCheckBox.IsChecked == true;
+ _settings.VoiceRepeaterWindow.FloatingEnabled = enabled;
+ IsAlwaysOnTop = enabled;
+ _settings.Save(logSuccess: false);
+ }
+
private void OnOpenVoiceStatusClick(object sender, RoutedEventArgs e)
{
OpenVoiceStatusRequested?.Invoke(this, EventArgs.Empty);
@@ -239,14 +257,35 @@ private void OnWindowClosed(object sender, WindowEventArgs e)
AppWindow.Changed -= OnAppWindowChanged;
}
- SaveWindowPlacement();
+ Activated -= OnWindowActivated;
+ FlushWindowPlacement();
IsClosed = true;
}
+ private void OnWindowActivated(object sender, WindowActivatedEventArgs args)
+ {
+ if (!_initialPlacementPending)
+ {
+ return;
+ }
+
+ _initialPlacementPending = false;
+ ApplyStoredWindowPlacement();
+
+ var dispatcherQueue = DispatcherQueue.GetForCurrentThread();
+ _ = dispatcherQueue?.TryEnqueue(() => _suppressPlacementSave = false);
+ }
+
private void OnAppWindowChanged(AppWindow sender, AppWindowChangedEventArgs args)
{
+ if (_suppressPlacementSave)
+ {
+ return;
+ }
+
if (args.DidPositionChange || args.DidSizeChange)
{
+ _placementDirty = true;
_layoutSaveTimer?.Stop();
_layoutSaveTimer?.Start();
}
@@ -272,7 +311,7 @@ private void ApplyStatus(VoiceStatusInfo status)
var paused = status.State == VoiceRuntimeState.Paused;
PauseResumeButton.IsEnabled = !_controlActionInFlight && status.Mode != VoiceActivationMode.Off;
- PauseResumeIcon.Glyph = paused ? "\uE768" : "\uE769";
+ PauseResumeIcon.Symbol = paused ? Symbol.Play : Symbol.Pause;
ToolTipService.SetToolTip(
PauseResumeButton,
paused ? "Resume voice mode" : "Pause voice mode");
@@ -282,21 +321,40 @@ private void ApplyStatus(VoiceStatusInfo status)
private void ApplyStoredWindowPlacement()
{
+ if (AppWindow is null)
+ {
+ return;
+ }
+
var prefs = _settings.VoiceRepeaterWindow;
- var width = prefs.Width.GetValueOrDefault(DefaultWidth);
- var height = prefs.Height.GetValueOrDefault(DefaultHeight);
+ var width = prefs.HasSavedPlacement
+ ? prefs.Width.GetValueOrDefault(DefaultWidth)
+ : DefaultWidth;
+ var height = prefs.HasSavedPlacement
+ ? prefs.Height.GetValueOrDefault(DefaultHeight)
+ : DefaultHeight;
+ var clampedWidth = Math.Max(width, 320);
+ var clampedHeight = Math.Max(height, 150);
+
+ IsAlwaysOnTop = prefs.FloatingEnabled;
- this.SetWindowSize(
- Math.Max(width, 320),
- Math.Max(height, 150));
+ var targetRect = prefs.HasSavedPlacement && prefs.X.HasValue && prefs.Y.HasValue
+ ? new RectInt32(prefs.X.Value, prefs.Y.Value, clampedWidth, clampedHeight)
+ : GetDefaultAnchorRect(clampedWidth, clampedHeight);
- if (prefs.X.HasValue && prefs.Y.HasValue)
+ if (!IsPlacementVisible(targetRect))
{
- AppWindow.Move(new PointInt32(prefs.X.Value, prefs.Y.Value));
+ targetRect = GetDefaultAnchorRect(clampedWidth, clampedHeight);
}
- else
+
+ try
{
- this.CenterOnScreen();
+ AppWindow.MoveAndResize(targetRect);
+ }
+ catch
+ {
+ this.SetWindowSize(targetRect.Width, targetRect.Height);
+ AppWindow.Move(new PointInt32(targetRect.X, targetRect.Y));
}
}
@@ -323,6 +381,7 @@ private void ApplyViewSettings()
}
AutoScrollCheckBox.IsChecked = _autoScrollEnabled;
+ FloatingEnabledCheckBox.IsChecked = _settings.VoiceRepeaterWindow.FloatingEnabled;
SelectTextSizeItem(_messageFontSize);
}
finally
@@ -333,7 +392,7 @@ private void ApplyViewSettings()
private void SaveWindowPlacement()
{
- if (IsClosed || AppWindow is null)
+ if (IsClosed || AppWindow is null || _suppressPlacementSave)
{
return;
}
@@ -344,7 +403,44 @@ private void SaveWindowPlacement()
_settings.VoiceRepeaterWindow.Height = size.Height;
_settings.VoiceRepeaterWindow.X = position.X;
_settings.VoiceRepeaterWindow.Y = position.Y;
+ _settings.VoiceRepeaterWindow.HasSavedPlacement = true;
_settings.Save(logSuccess: false);
+ _placementDirty = false;
+ }
+
+ private void FlushWindowPlacement()
+ {
+ if (_placementDirty || !IsClosed)
+ {
+ SaveWindowPlacement();
+ }
+ }
+
+ private RectInt32 GetDefaultAnchorRect(int width, int height)
+ {
+ var displayArea = DisplayArea.Primary;
+ var x = displayArea.WorkArea.X + DefaultMargin;
+ var y = displayArea.WorkArea.Y + Math.Max(DefaultMargin, displayArea.WorkArea.Height - height - DefaultMargin);
+ return new RectInt32(x, y, width, height);
+ }
+
+ private static bool IsPlacementVisible(RectInt32 rect)
+ {
+ try
+ {
+ var displayArea = DisplayArea.GetFromRect(rect, DisplayAreaFallback.Nearest);
+ var workArea = displayArea.WorkArea;
+ return rect.Width > 0 &&
+ rect.Height > 0 &&
+ rect.X < workArea.X + workArea.Width &&
+ rect.X + rect.Width > workArea.X &&
+ rect.Y < workArea.Y + workArea.Height &&
+ rect.Y + rect.Height > workArea.Y;
+ }
+ catch
+ {
+ return false;
+ }
}
private void SelectTextSizeItem(double size)
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs
index b331cb3..fbe5845 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomBridge.cs
@@ -7,14 +7,7 @@ internal static class WebChatVoiceDomBridge
public const string DocumentCreatedScript = """
(() => {
const isVisible = (el) => !!el && !(el.disabled === true) && el.getClientRects().length > 0;
- const memoryPattern = /<relevant-memories>[\s\S]*?<\/relevant-memories>\s*/gi;
let desiredDraft = '';
- let stripInjectedMemories = true;
-
- const sanitize = (value) => {
- const text = typeof value === 'string' ? value : '';
- return stripInjectedMemories ? text.replace(memoryPattern, '').trimStart() : text;
- };
const findComposer = () => {
const candidates = Array.from(document.querySelectorAll('textarea, input[type="text"], [contenteditable="true"], [contenteditable="plaintext-only"]'));
@@ -22,23 +15,23 @@ internal static class WebChatVoiceDomBridge
};
const setElementValue = (el, value) => {
- const sanitized = sanitize(value);
+ const text = typeof value === 'string' ? value : '';
if ('value' in el) {
const proto = el.tagName === 'TEXTAREA' ? HTMLTextAreaElement.prototype : HTMLInputElement.prototype;
const descriptor = Object.getOwnPropertyDescriptor(proto, 'value');
if (descriptor && descriptor.set) {
- descriptor.set.call(el, sanitized);
+ descriptor.set.call(el, text);
} else {
- el.value = sanitized;
+ el.value = text;
}
- el.dispatchEvent(new InputEvent('input', { bubbles: true, data: sanitized, inputType: 'insertText' }));
+ el.dispatchEvent(new InputEvent('input', { bubbles: true, data: text, inputType: 'insertText' }));
el.dispatchEvent(new Event('change', { bubbles: true }));
return;
}
if (el.isContentEditable) {
- el.textContent = sanitized;
- el.dispatchEvent(new InputEvent('input', { bubbles: true, data: sanitized, inputType: 'insertText' }));
+ el.textContent = text;
+ el.dispatchEvent(new InputEvent('input', { bubbles: true, data: text, inputType: 'insertText' }));
el.dispatchEvent(new Event('change', { bubbles: true }));
}
};
@@ -76,10 +69,6 @@ internal static class WebChatVoiceDomBridge
desiredDraft = text || '';
return applyDraftIfPossible();
},
- setStripInjectedMemories(enabled) {
- stripInjectedMemories = !!enabled;
- return applyDraftIfPossible();
- },
clearDraft() {
desiredDraft = '';
return applyDraftIfPossible();
@@ -88,16 +77,10 @@ internal static class WebChatVoiceDomBridge
clearLegacyTurnsHost();
return true;
}
- };
+ };
})();
""";
- public static string BuildSetStripInjectedMemoriesScript(bool enabled)
- {
- var value = enabled ? "true" : "false";
- return $"window.__openClawTrayVoice?.setStripInjectedMemories?.({value});";
- }
-
public static string BuildSetDraftScript(string? text)
{
if (string.IsNullOrWhiteSpace(text))
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs
index e8f77f6..59cdcfe 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatVoiceDomState.cs
@@ -2,22 +2,14 @@ namespace OpenClawTray.Windows;
internal sealed class WebChatVoiceDomState
{
- public WebChatVoiceDomState(bool stripInjectedMemories)
+ public WebChatVoiceDomState()
{
- StripInjectedMemories = stripInjectedMemories;
}
- public bool StripInjectedMemories { get; private set; }
-
public string PendingDraft { get; private set; } = string.Empty;
public void SetDraft(string? text, bool clear)
{
PendingDraft = clear ? string.Empty : (text ?? string.Empty);
}
-
- public void SetStripInjectedMemories(bool enabled)
- {
- StripInjectedMemories = enabled;
- }
}
diff --git a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
index 93e62fc..1b0f000 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/WebChatWindow.xaml.cs
@@ -27,12 +27,12 @@ public sealed partial class WebChatWindow : WindowEx
public bool IsClosed { get; private set; }
- public WebChatWindow(string gatewayUrl, string token, bool stripInjectedMemories)
+ public WebChatWindow(string gatewayUrl, string token)
{
Logger.Info($"WebChatWindow: Constructor called, gateway={gatewayUrl}");
_gatewayUrl = gatewayUrl;
_token = token;
- _voiceDomState = new WebChatVoiceDomState(stripInjectedMemories);
+ _voiceDomState = new WebChatVoiceDomState();
InitializeComponent();
@@ -281,12 +281,6 @@ public async Task AppendVoiceConversationTurnAsync(VoiceConversationTurnEventArg
await Task.CompletedTask;
}
- public async Task SetStripInjectedMemoriesEnabledAsync(bool enabled)
- {
- _voiceDomState.SetStripInjectedMemories(enabled);
- await RefreshTrayVoiceDomStateAsync();
- }
-
private async Task RefreshTrayVoiceDomStateAsync()
{
if (WebView.CoreWebView2 == null || !_voiceDomReady || IsClosed)
@@ -296,8 +290,6 @@ private async Task RefreshTrayVoiceDomStateAsync()
try
{
- await WebView.CoreWebView2.ExecuteScriptAsync(
- WebChatVoiceDomBridge.BuildSetStripInjectedMemoriesScript(_voiceDomState.StripInjectedMemories));
await WebView.CoreWebView2.ExecuteScriptAsync(
WebChatVoiceDomBridge.BuildSetDraftScript(_voiceDomState.PendingDraft));
await WebView.CoreWebView2.ExecuteScriptAsync(WebChatVoiceDomBridge.ClearLegacyTurnsScript);
From a78764006ba55b769670880c20d2ca89bb5a6c5d Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 30 Mar 2026 12:12:19 +0100
Subject: [PATCH 81/83] Add voice mode feature icon asset
---
README.md | 3 ++-
.../Assets/voice-mode-feature.png | Bin 0 -> 1048 bytes
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 src/OpenClaw.Tray.WinUI/Assets/voice-mode-feature.png
diff --git a/README.md b/README.md
index 9ea5ef5..ef256c4 100644
--- a/README.md
+++ b/README.md
@@ -84,7 +84,7 @@ Modern Windows 11-style system tray companion that connects to your local OpenCl
- 🚀 **Auto-start** - Launch with Windows
- ⚙️ **Settings** - Full configuration dialog
- 🎯 **First-run experience** - Welcome dialog guides new users
-- 🦞🎧 **Voice Mode (new)** - Talk to your Claw via your Windows node
+- <img src="src/OpenClaw.Tray.WinUI/Assets/voice-mode-feature.png" alt="Voice Mode" width="20" height="20" /> **Voice Mode (new)** - Talk to your Claw via your Windows node
### Menu Sections
- **Status** - Gateway connection status with click-to-view details
@@ -115,6 +115,7 @@ Comparing against [openclaw-menubar](https://github.com/magimetal/openclaw-menub
| Refresh | ✅ | ✅ | Auto-refresh on menu open |
| Launch at Login | ✅ | ✅ | |
| Notifications toggle | ✅ | ✅ | |
+| Voice Mode | ✅ | 🟡 | Talk Mode implemented (half-duplex), WakeWord, Interrupt, etc. in progress
### Windows-Only Features
diff --git a/src/OpenClaw.Tray.WinUI/Assets/voice-mode-feature.png b/src/OpenClaw.Tray.WinUI/Assets/voice-mode-feature.png
new file mode 100644
index 0000000000000000000000000000000000000000..04c239b36cb922d8f70dc38ae294927e324c9aa3
GIT binary patch
literal 1048
zcmV+z1n2vSP)<h;3K|Lk000e1NJLTq000yK000yS1^@s6jfou%00001b5ch_0Itp)
z=>Px#1ZP1_K>z@;j|==^1poj532;bRa{vGi!vFvd!vV){sAK>D1FcC!K~y+TwUAqE
zQ*|82{|KAdQ(cGGHCBPG+j>9kdO4>(r{}cm*6ZorwYzjO9!yL$3o!$Ohz3NBi7s3W
zs2L9+5DjXO*nr}Q3f?AhULyLi_#ol~8lM~(@O;f_wMmal^u<r|O@8_Pf8Wo;pOAk+
zmSw71eYf-Y@wbftwCeZ&1b}6mAAKU3$d(VWV(fEPjQ_w2@e8aFn~%f`uW#76r3|1|
z-CL3)N8VUnD2yJpdc>=1oIKXj9C}?GOfDXsGzVp8b_<YlBPVw6-e*(yn>hK}$=i~d
z(rmYr1H*=~Ari%tK@N|3-SSpXpS*)(v3W3xFh77UM*xF~!nta-+Nt)`IPlOIRW4Ot
zvh)Q(a|0JY8z1@+69r&Y0@6Y~V-hf)!T0M*3;u}ooyo~T%9M_so}SjJeWc8Z8)>fy
zql>@xL8bJoJm|iRS6_zw<vfsJfrKCOS6=|H9E9BOT>R$Vp-Wa@=r@};gt3^ksY!M_
zA=@_v`Je7fr7;zaUzLXwe<01{*^GqoY#zu6wNx^BOlKq@6NFrhTzg8A7AKMt9`FY)
zb`Y|yA)FDsXTM&b!%R7cry~(uNQpopjAI@fc3HZBTnNa9u+!XyH@!9>AI7DWi02{^
z>@8(+Ze14p1<#p=aAAG=^vpde=&cM^Ss3i#IUpz2iYW908L=K^R(3fN&jfkgVr6lM
z#fN=`6c#3=*@loID-QP=y07%Og6OdO@i6z7@Nbbic@a|_SG!d@ZiDfPkE}S<5Oxrv
zv06NPY;FO18;dPGUn~7X)DVX72+!AUm5#;a-laGiLIsn9eO!6$aq$RyxTA70bOysQ
zpg7Q+;TYt+@JZMmc&ppR*Mz;ae`}LW!Qx=HIecN*y<r$Tt>4ab_WR}0#Dj7vI@y%b
z=oBVmoWah%NPEMeef==g{_ScXF*_Kq+3ByEUHmDRe)a6Fgj5J2c0!0-6^0OEFuA#i
z#Tj_VLI+Ou(u|<?H8DU4m1@(SEw<kbN!Q+#Li;Kt+jJj#3Ast#Tavp;o6g;$IsZOG
zV=*n@=LCme;`J!=<+OnJ7#gk?&H1&Yy-nS3!mrixtVa8BUkf!Xr-M$pl6V0labRe$
zo{>1@vE<(0(?O@Jg&JWs+K+r%EvNQ0U@6MV+;C&<gYR12TAaL7KX=$+#(rB*Ey_G;
z(0{sSd22zrQEmeiWl?*U#IK>Ar(0Lf8JD%3=)SHVWu~uRImc+Ud(`*;8-D_QZr>QY
SdWmQN0000<MNUMnLSTaZT<#J8
literal 0
HcmV?d00001
From 67771c2bc31a6f0d050eb84a996a7104e46e50c0 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 30 Mar 2026 17:05:11 +0100
Subject: [PATCH 82/83] Address remaining voice review comments
---
.../Voice/VoiceProviderCatalogService.cs | 4 ++--
.../Windows/SettingsWindow.xaml.cs | 19 ++++++++++++++++---
.../VoiceProviderCatalogServiceTests.cs | 2 +-
3 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
index 41e26dd..8bc4d0a 100644
--- a/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
+++ b/src/OpenClaw.Tray.WinUI/Services/Voice/VoiceProviderCatalogService.cs
@@ -118,9 +118,9 @@ private static VoiceProviderCatalog NormalizeCatalog(VoiceProviderCatalog catalo
};
}
- private static List<VoiceProviderOption> NormalizeProviders(List<VoiceProviderOption> providers)
+ private static List<VoiceProviderOption> NormalizeProviders(List<VoiceProviderOption>? providers)
{
- return providers
+ return (providers ?? [])
.Where(p => !string.IsNullOrWhiteSpace(p.Id))
.Select(Clone)
.Where(p => p.Enabled || p.VisibleInSettings)
diff --git a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
index eb686a8..e0093b2 100644
--- a/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
+++ b/src/OpenClaw.Tray.WinUI/Windows/SettingsWindow.xaml.cs
@@ -72,7 +72,7 @@ private void LoadSettings()
NodeModeToggle.IsOn = _settings.EnableNodeMode;
}
- private async Task SaveSettingsAsync()
+ private async Task<bool> SaveSettingsAsync()
{
_settings.GatewayUrl = GatewayUrlTextBox.Text.Trim();
_settings.Token = TokenTextBox.Text.Trim();
@@ -95,10 +95,20 @@ private async Task SaveSettingsAsync()
_settings.NotifyInfo = NotifyInfoCb.IsChecked ?? true;
_settings.EnableNodeMode = NodeModeToggle.IsOn;
- await VoiceSettingsPanel.ApplyAsync(_settings);
+ try
+ {
+ await VoiceSettingsPanel.ApplyAsync(_settings);
+ }
+ catch (Exception ex)
+ {
+ Logger.Error($"[Settings] Failed to apply voice settings: {ex.Message}");
+ StatusLabel.Text = $"❌ Failed to apply voice settings: {ex.Message}";
+ return false;
+ }
_settings.Save();
AutoStartManager.SetAutoStart(_settings.AutoStart);
+ return true;
}
private async void OnTestConnection(object sender, RoutedEventArgs e)
@@ -200,7 +210,10 @@ private async void OnSave(object sender, RoutedEventArgs e)
var oldGateway = _settings.GatewayUrl;
var oldAutoStart = _settings.AutoStart;
var oldNodeMode = _settings.EnableNodeMode;
- await SaveSettingsAsync();
+ if (!await SaveSettingsAsync())
+ {
+ return;
+ }
if (!string.Equals(oldGateway, _settings.GatewayUrl, StringComparison.Ordinal))
{
diff --git a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
index cac28f6..f6ff8ca 100644
--- a/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
+++ b/tests/OpenClaw.Tray.Tests/VoiceProviderCatalogServiceTests.cs
@@ -62,7 +62,7 @@ public void SupportsSpeechToTextRuntime_ReturnsTrueOnlyForWindowsMediaRoute()
}
[Fact]
- public void SupportsTextToSpeechRuntime_ReturnsTrueForMiniMaxOnlyWhenImplemented()
+ public void SupportsTextToSpeechRuntime_ReturnsTrueForImplementedProviders()
{
Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.Windows));
Assert.True(VoiceProviderCatalogService.SupportsTextToSpeechRuntime(VoiceProviderIds.MiniMax));
From 4756563dc3290b184393b05ba6f1c5df2c353f20 Mon Sep 17 00:00:00 2001
From: Nich Overend <nich@nixnet.com>
Date: Mon, 30 Mar 2026 18:16:10 +0100
Subject: [PATCH 83/83] Tweak voice mode README credit wording
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index ef256c4..1caf343 100644
--- a/README.md
+++ b/README.md
@@ -250,7 +250,7 @@ OpenClaw registers the `openclaw://` URL scheme for automation and integration:
Deep links work even when Molty is already running - they're forwarded via IPC.
### Voice Mode
-*built by NichUK and his colleagues @codex and @copilot*
+*contributed by NichUK and his colleagues @codex and @copilot*
Currently supports Talk Mode - Always on talk to your Claw! Wakeword and PTT modes coming soon
- Uses internal Windows STT (cloud providers coming soon)