feat(teleop): hosted Go2 one-session module + robot commands, multicam, battery, LiveKit#2562
Draft
ruthwikdasyam wants to merge 53 commits into
Draft
feat(teleop): hosted Go2 one-session module + robot commands, multicam, battery, LiveKit#2562ruthwikdasyam wants to merge 53 commits into
ruthwikdasyam wants to merge 53 commits into
Conversation
…ailure; surface heartbeat errors
❌ 3 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
If POST /sessions returned non-201 or wait_connected raised, _connect left self._pc and self._http set — the next start() overwrote them without cleanup, orphaning the PC and httpx connection pool. The broker's new partial-failure 502s make this hit more often. Wrap the body in try/except; on any failure call _disconnect (defensive about None refs) before re-raising. Also cap the error body log at 200 chars — SDP carries short-lived ICE ufrag/pwd.
If the broker force-deletes the session (delete_session full teardown, key revocation) the heartbeat used to log a warning every tick at heartbeat_hz forever. Track a consecutive 401/404 streak and break out of the loop after 5. _heartbeat_once now returns the HTTP status (None on skip) so the loop can see it.
The connectionstatechange handler only logged at INFO and never reacted to failed/closed. Combined with the heartbeat loop that used to spin through 404s indefinitely, a robot with a dead PC kept publishing frames into the void with no visible signal. Log terminal states at ERROR so the orchestrator (and operators tailing logs) see them.
publish() rechecks readyState under the lock then schedules ch.send via call_soon_threadsafe. Between the check and the loop tick the channel can transition to closing (operator pagehide → broker reissues SCTP ids → heartbeat closes), so ch.send raises InvalidStateError and bubbles to the default asyncio exception handler. Wrap the send in a closure that rechecks readyState on the loop thread and swallows the exception at debug level.
state_reliable's message handler calls _maybe_answer_ping, which looks up state_reliable_back in _dcs. With the previous insert order, a clock-sync ping that arrived between createDataChannel(state_reliable) and the _dcs[state_reliable] = ch insert would find no state_reliable_back yet and drop the pong with a warning. Reorder the heartbeat's ids dict so state_reliable_back is opened first. Python 3.7+ preserves insertion order on dict iteration, so the for-loop processes it before state_reliable.
set_rage_mode applied a 2s settle only when enable=True. On disable, SwitchJoystick(False) fired immediately after RAGEMODE(False) — the FSM was still mid-transition out of Rage, the joystick toggle got silently no-op'd, and the robot stayed in Rage. The user's only recovery was disconnect + relaunch. Apply the settle in both directions.
_rage_active was initialised to False in __init__ and only ever flipped by the operator's set_mode click. If the firmware was left in Rage by a previous session, the short-circuit at "want_rage == self._rage_active" returned "already in right FSM" on every Normal/High click and the user was locked in Rage until disconnect + relaunch. Force set_rage_mode(False) at start. The base GO2Connection only forces RAGE *on* (when config.mode says so); we mirror that with an explicit off so the tracked state matches the firmware.
cmd_unreliable is lossy + unordered SCTP — a late packet can clobber a fresher one. The reported symptom: press Q (left strafe), release, press W (forward) → if Q's twist arrives at the robot after W's, the robot keeps strafing left. Safety-relevant. Override Go2HostedConnection.move() to: - drop if operator-stamped ts is >0.5s old (existing staleness intent) - drop if ts <= the highest ts seen on this stream (ordering) Then delegate to super().move(). ts comes from TwistStamped's Timestamped mixin (LCM header.stamp). _last_cmd_ts is initialised to 0.0 in __init__; first valid cmd always passes.
set_rage_mode(False) ended with SwitchJoystick(False), which puts the robot in a static standup posture — the CoM shifts but the FSM won't accept velocity commands (observed: exit Rage → try WASD → nothing). Follow SwitchJoystick(False) with BalanceStand + RecoveryStand on the disable path. BalanceStand alone isn't enough on real hardware; RecoveryStand reliably lands the FSM in the state that consumes WIRELESS_CONTROLLER twists.
Stand/Drive button did standup → sleep → BalanceStand and acked ok, but on real hardware after transitions from Sit / Rage / StandDown the FSM wasn't accepting velocity — WASD did nothing. Only a separate Recovery press worked. Append sport_command(RecoveryStand) after BalanceStand so the combined Stand/Drive posture lands the robot in the accepts-velocity state directly.
…c standup" Unsure whether SwitchJoystick(False) actually leaves the FSM in the wrong state or the added BalanceStand + RecoveryStand nudges it into a different unintended state. Roll back for now; keep the symmetric 2s settle from the earlier commit. This reverts commit 6d5f722.
…drop dead stats, add hosted-connection tests
_dispatch fanned out to subscribers but never answered pings, so on
LiveKit robots the operator's state.bestRttMs stayed Infinity,
clockOffsetMs stayed 0, and command timestamps went on the wire in
raw operator wall clock — the robot's move() staleness check then
compared time.time() to an operator-clock ts and either dropped
everything or accepted stale packets depending on host clock skew.
Port _maybe_answer_ping from BrokerProvider verbatim: on
topic=state_reliable, decode the ping and publish
{type:"pong", client_ts:<echo>, robot_ts: time.time()} on
state_reliable_back, reliable, before subscriber fanout.
…ency stamp reaches operator
The comment above the cmd_vel auto-stop timer claimed 0.5 seconds while cmd_vel_timeout is 0.2. It's the safety-critical constant for hosted teleop (halts the base when the operator link drops mid-drive), so the comment must not lie about it. (HARDENING_PLAN B4 in dimensional-teleop.)
Every sport_cmd/set_mode/obstacle_avoidance spawned an unbounded daemon thread, and _rage_active was read on the callback thread while a previous toggle's thread wrote it (rapid toggles could double-fire the firmware). - single-worker executor (repo pattern, cf. utils/threadpool and drake_world): commands run strictly in order — the rage check moves inside the serialized task, so the race is gone by construction - bounded backlog: past 4 pending, commands are busy-rejected with ack ok=false instead of piling up threads - Damp is the E-STOP: it bypasses the queue on a dedicated thread so a stop never waits behind a ~3.3s StandReady holding the worker Tests: ordering, backlog rejection, and the Damp bypass; executor reaper fixture keeps the repo thread-leak check green. (HARDENING_PLAN B1+B3 in dimensional-teleop.)
Transport/UI duplicates of a nonce'd command (sport_cmd / set_mode / obstacle_avoidance) within a 10s window re-ack the prior result instead of re-executing; in-flight duplicates are dropped (the original acks). Busy rejections and shutdown races unwind the reservation so genuine retries still run. Scope: JSON action commands only — cmd_vel twists carry no nonce and are already guarded by the monotonic-timestamp drop in move(). TTL is short on purpose: browser nonces restart at 1 per operator session. (HARDENING_PLAN B2 in dimensional-teleop.)
Mirrors BrokerProvider's terminal condition: a revoked key or force- deleted session otherwise retries + log-floods at heartbeat_hz forever. Any other status (or network error) resets the streak. (HARDENING_PLAN B5 in dimensional-teleop.)
Two safety gaps from the teleop audit: E-STOP was just an allow-listed
Damp (no latch — twists kept flowing after it), and nothing reacted when
the operator's command plane vanished (in-flight commands ran to
completion; only the 0.2s cmd_vel deadman covered the base).
- {"type":"estop"}: latch FIRST (move() refuses twists instantly,
non-urgent commands rejected), then Damp on the urgent path — never
queued behind a slow StandReady. {"type":"estop_clear"} re-arms
without moving the robot.
- operator-lost: both providers inject a synthetic
{"type":"operator_lost"} to state_reliable subscribers — CF detects
the state channel id going away on heartbeat, LiveKit uses
participant_disconnected. The module zeros the base (stop_movement),
clears the per-session nonce cache, and optionally Damps
(config.damp_on_operator_lost, off by default: a WiFi blip shouldn't
drop the robot mid-patrol).
Tests: latch/refuse/re-arm, clear-does-not-move, operator-lost stop +
nonce reset, config-gated damp.
(HARDENING_PLAN A2 in dimensional-teleop.)
robot_telemetry gains a state dict — posture (tracked off successful posture commands; gestures don't touch it), rage, obstacle avoidance, camera selection, and the E-STOP latch — so a (re)connecting operator's cockpit seeds from reality instead of optimistic defaults (it assumed StandReady + OA on, and couldn't know rage/cams at all). Telemetry now always publishes (state is always meaningful, not just when cmd stats/battery exist); payload built by _telemetry_payload() for testability. Roadmap Phase 2 item. (HARDENING_PLAN A3 in dimensional-teleop.)
…dec pref, LiveKit encoding The video pipeline ran wide open (aiortc defaults, source rate/resolution); congestion surfaced as encoder drops and freezes instead of a bounded stream. All knobs are opt-in (0/empty = today's behavior): - Go2HostedConnectionConfig.video_max_fps / video_max_width: cap at the mux — fps gate before any composite work, downscale before the latency stamp so its 16px cells stay decodable. Holds on both transports. - BrokerConfig.video_codec: reorder aiortc codec preferences (e.g. h264 first) on the video transceiver; unknown codec falls back to defaults. - LiveKitBrokerConfig.video_max_bitrate_bps / video_max_fps → TrackPublishOptions.video_encoding on the published track. Tests: fps cap, width downscale (aspect preserved), zero-config passthrough. (HARDENING_PLAN E1 in dimensional-teleop.)
- set_light(on) on the Go2 connection protocol: VUI brightness api
(api_id 1005, level 10/0 — same service color() already drives);
no-op stubs on replay/mujoco/dimsim
- hosted: {"type":"light", "enabled", "nonce"} handled like the
obstacle-avoidance toggle (serialized worker, acked); _light_on
tracked and pushed in robot_telemetry.state so the cockpit reconciles
Tests: toggle updates state + acks; failed RPC leaves state unchanged.
set_light now takes the firmware-native VUI level (0-10, clamped); the
hosted handler accepts {"type":"light", "brightness": 0..1} (validated,
NaN-rejected, clamped) and maps to levels. The already-deployed toggle's
{"enabled": bool} still works — mapped to full/zero brightness — so the
live frontend keeps functioning until the slider UI ships.
robot_telemetry.state.light is now the 0..1 float.
Tests: level mapping, legacy bool compat, clamp + malformed rejection,
failed-RPC state retention.
…t disabled Root cause: set_rage_mode ended with SwitchJoystick(enable), so turning rage OFF disabled firmware joystick listening — and the hosted module force-resets rage OFF at every start(). move()'s WIRELESS_CONTROLLER stick emulation was then silently ignored: keys pressed, robot deaf. - switch_joystick(enable) extracted as a proper connection method (protocol + sim stubs); set_rage_mode now always re-enables it — rage only changes the speed envelope, never joystick listening - Stand/Drive combo reordered to END drive-ready: standup → RecoveryStand (recover from Sit/Damp) → BalanceStand → SwitchJoystick(True). The old order ended in RecoveryStand, which is not a stick-accepting FSM state. Test pins the exact call order and the joystick re-enable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
feat/webrtc-transportshipped hosted teleop as a pure transport swap, but the Go2 driver and the hosted state plane lived in separate modules → separate processes. Because the broker provider is a per-process singleton andGO2Connectionisdedicated_worker=True, the state-plane module opened a second Cloudflare session the operator never saw — so robot→operator telemetry (battery, command-plane stats) silently never arrived. The transport path also had no robot-command channel (posture/actions), no multi-camera support, and only one SFU backend (Cloudflare).Solution
Consolidate the hosted control plane into the Go2 driver as one
dedicated_workermodule (Go2HostedConnection(GO2Connection)) so everything shares one CF session, then build the operator-facing features on top:dimos/robot/unitree/go2/hosted_connection.py. Folds state-plane + camera mux into the driver's process; replaces the standaloneTeleopStateBridge(removed).sport_cmdonstate_reliable, acked back): posture (StandReady combo / Sit / StandDown / RecoveryStand), actions (Hello / Stretch / FrontPounce / FrontJump), and an E-stop (Damp) out-of-band of the twist stream. Newsport_command(api_id)+ bidirectionalset_rage_mode(enable)on the connection.set_mode→ Normal / High (browser scale) + Rage (firmware FSM toggle).camera_select), graceful degrade to Go2-only. Newteleop-hosted-go2-multicamblueprint.get_battery_soc()+_on_lowstate(cachesbms_state.soc).cmd_ackonstate_reliable_back; restored inline pong answering (_maybe_answer_ping) so clock-sync RTT works on the transport path.LiveKitBrokerProvider+LiveKitTransport/LiveKitVideoTransportas a drop-in alternative SFU to Cloudflare; newteleop-hosted-go2-livekitblueprint.Removed the
teleop-state-bridgeandhosted-teleop-recorderblueprints (the latter only set a path; compose the genericteleop-recorderinstead).How to Test
Connect from the operator web client: drive (WASD → twist), fire posture/actions, watch live battery + command-plane telemetry in the HUD. For the two-camera path (needs a RealSense on the dimos host) use
teleop-hosted-go2-multicam; for the LiveKit SFU useteleop-hosted-go2-livekit.Contributor License Agreement