Skip to content

Initial xArm6 WorldBelief perception stack#2665

Open
jhengyilin wants to merge 1 commit into
dimensionalOS:mainfrom
jhengyilin:jhengyi/perception-worldbelief-clean-0625
Open

Initial xArm6 WorldBelief perception stack#2665
jhengyilin wants to merge 1 commit into
dimensionalOS:mainfrom
jhengyilin:jhengyi/perception-worldbelief-clean-0625

Conversation

@jhengyilin

Copy link
Copy Markdown

WorldBelief live object identity for xArm6 manipulation

flowchart TB
    camera([RealSense RGB-D camera])

    detector["YOLO-E prompt detector<br/><i>prompt mode, tracker disabled in xArm blueprint</i>"]
    osr["ObjectSceneRegistration<br/><i>RGB-D objects + CLIP/DINO crop embeddings</i>"]
    wb["WorldBelief<br/><i>stable IDs, support windows, re-acquisition</i>"]
    rerun["Rerun<br/><i>annotated image + 3D workspace</i>"]
    manip["PickAndPlaceModule<br/><i>reads present objects</i>"]
    recorder["XArm6WorldBeliefRecorder<br/><i>records replay-critical streams</i>"]

    subgraph m2 ["memory2"]
        direction TB
        subgraph session ["per-run recording DB"]
            direction LR
            color[(color_image)]
            depth[(depth_image)]
            info[(camera_info)]
            det2d[(detections_2d)]
            det3d[(detections_3d)]
            audit[(worldbelief_audit)]
        end
        subgraph history ["worldbelief_history.db"]
            direction LR
            object_events[(object evidence)]
            semantic_vec[(semantic vectors)]
            visual_vec[(visual vectors)]
        end
    end

    camera --> detector
    detector --> osr
    camera --> osr
    osr -->|"frame objects"| wb
    osr -->|"annotated image + current-frame pointcloud"| rerun
    wb -->|"present objects"| manip
    wb -->|"detections_3d + audit"| rerun
    wb ==> object_events
    wb ==> semantic_vec
    wb ==> visual_vec
    object_events -. "rehydrate compact identity state on restart" .-> wb
    camera ==> recorder
    osr ==> recorder
    wb ==> recorder
    recorder ==> color
    recorder ==> depth
    recorder ==> info
    recorder ==> det2d
    recorder ==> det3d
    recorder ==> audit

    classDef stream fill:#fef3c7,stroke:#d97706,stroke-width:2px
    classDef module fill:#dbeafe,stroke:#2563eb,stroke-width:2px
    classDef memory fill:#dcfce7,stroke:#16a34a,stroke-width:2px
    classDef external fill:#f3f4f6,stroke:#6b7280,stroke-width:1px
    class color,depth,info,det2d,det3d,audit,object_events,semantic_vec,visual_vec stream
    class detector,osr,wb,rerun,manip,recorder module
    class m2,session,history memory
    class camera external
Loading

What this unlocks

This PR adds a live object identity layer for the xArm6 perception/manipulation stack. The detector still sees frame-local masks; WorldBelief turns those observations into stable workspace objects that pick/place and visualization can trust.

Capability What changed Why it matters
Stable live object IDs WorldBelief associates current 3D observations against maintained identity state using geometry, labels, CLIP, DINO, support windows, and re-acquisition policy. A can should keep its identity while it moves, disappears briefly, or is seen again after camera motion.
Manipulation-facing present set OSR publishes WorldBelief present_objects to the objects port and Detection3DArray. Pick/place consumes a filtered workspace state, not every noisy frame detection.
Cross-session identity seed WorldBelief writes compact object evidence and vector evidence into a stable Memory2 history DB, then rehydrates maintained state on restart. The next process starts with prior identity evidence instead of a blank identity table.
Cleaner Rerun view Blueprint opens Rerun with annotated image on the left and 3D workspace on the right. Current-frame pointclouds are used for visual blobs. The display reflects both live camera evidence and trusted WorldBelief state without stale pointcloud trails.
Replay/debug evidence A dedicated recorder writes a fresh per-run Memory2 DB for color/depth/camera info/detections/audit streams. We can inspect what the stack saw and what WorldBelief decided without mixing sessions.

How it works - walkthrough

t Event What happens
0 Blueprint starts The xArm6 WorldBelief blueprint wires RealSense, YOLO-E, OSR, WorldBelief, Rerun, recorder, and pick/place. The stable history DB is opened if configured.
1 Camera sees objects YOLO-E produces prompt-mode masks. OSR uses color, depth, camera info, and TF to build 3D Object observations.
2 Appearance evidence is attached OSR crops each object and attaches CLIP semantic embeddings and DINO visual embeddings.
3 WorldBelief updates The identity engine matches observations to existing IDs or creates candidates. Objects become present only after enough recent support.
4 Manipulation reads state present_objects are published to the objects port and detections_3d. Pick/place sees stable workspace objects rather than raw detector churn.
5 Rerun displays state The annotated image uses current-frame identity assignments, while the 3D view receives trusted boxes plus current-frame colored pointcloud visualization.
6 Object leaves view Frustum and camera-motion handling keep a hidden identity available for re-acquisition instead of immediately deleting or publishing stale visual blobs.
7 Object returns WorldBelief can reuse the prior ID when geometry and appearance evidence are strong enough. If evidence is ambiguous, creating a new ID is safer than forcing a wrong merge.
8 Process restarts Compact WorldBelief evidence is rehydrated from Memory2 history so the identity table does not always start from scratch.

Runtime object model

The PR deliberately separates four related concepts:

Layer Meaning Used by
Raw detections YOLO-E masks/classes from the current image. OSR object construction.
Frame objects Current-frame 3D objects after WorldBelief assigns IDs. Annotated image and audit.
Present objects Objects with enough recent support to be trusted as present. objects, detections_3d, pick/place, 3D boxes.
Maintained objects Remembered identities kept internally for hidden re-acquisition/history. WorldBelief lifecycle and Memory2 history.

This split is why the Rerun pointcloud can stay visually clean while manipulation still receives stable present objects.

Why this design

  • WorldBelief is live task memory. Memory2 stores durable evidence, but the robot still needs a current materialized workspace state with deterministic identity policy.
  • Raw detector output is not enough for manipulation. Prompt-mode detection can flicker, split, or merge frame to frame. WorldBelief adds support windows, lifecycle state, and appearance-aware association before publishing objects to pick/place.
  • Memory2 remains the durable layer. WorldBelief writes object and vector evidence into Memory2 history and can rehydrate from it, but this PR does not claim to ship the final natural-language Memory2 query workflow.
  • Detector tracking is intentionally not the identity source. The xArm6 blueprint disables YOLO tracking so stable IDs come from the robot-side identity model instead of detector-local track state.
  • Visualization is separated from manipulation state. Current-frame pointcloud blobs are for visual clarity; present_objects and detections_3d remain the manipulation-facing outputs.

Main files

Area Files
Product blueprint dimos/robot/manipulators/xarm/blueprints/worldbelief.py, dimos/robot/all_blueprints.py, dimos/manipulation/blueprints.py
OSR integration dimos/perception/object_scene_registration.py
Identity engine dimos/perception/detection/world_belief.py, identity_association.py, identity_features.py
Durable history dimos/perception/detection/world_belief_history.py
Object representation dimos/perception/detection/type/detection3d/object.py
Embeddings/detector support dimos/models/embedding/dino.py, clip.py, mobileclip.py, dimos/perception/detection/detectors/yoloe.py
Recording dimos/robot/manipulators/xarm/worldbelief_recorder.py, dimos/memory2/module.py

Known limits / follow-ups

  • Exact same-location swaps between visually similar cans are still a hard identity case. If the physical objects exchange positions perfectly, stronger appearance mismatch policy may need a follow-up threshold split.
  • CLIP + DINO improve identity evidence but add latency. If annotated-image smoothness becomes the priority, the next step is a fast visual overlay before embedding completion while keeping trusted WorldBelief outputs after embeddings.
  • Runtime detector prompt updates currently replace the active prompt list. Append-to-default prompt workflow is a follow-up.
  • Memory2 vector evidence is recorded for future search/recall work, but this PR does not add the final agent query API.

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds the xArm6 WorldBelief perception stack.

  • New WorldBelief identity association and durable history storage.
  • OSR integration for embeddings, present objects, audit events, and annotated images.
  • xArm6 hardware blueprint wiring camera, detector, WorldBelief, Rerun, recorder, and pick/place.
  • DINO visual embeddings and updated stable Detection3D object output.
  • Dedicated Memory2 recorder for replay-critical perception streams.

Confidence Score: 4/5

The restart and history restore path can publish incorrect identity state until the rehydration lifecycle is fixed.

  • Restored objects keep old timestamps and an empty support window.
  • Normal restart gaps can prevent durable IDs from being reused.
  • Rehydrated-but-not-present objects can make new live objects wait behind the candidate gate.

dimos/perception/detection/world_belief_history.py, dimos/perception/detection/world_belief.py

Important Files Changed

Filename Overview
dimos/perception/detection/world_belief.py Adds support-gated object identity, association, candidate gating, re-acquisition, audit state, and history restore wiring.
dimos/perception/detection/world_belief_history.py Adds Memory2-backed object and vector evidence history plus compact state rehydration.
dimos/perception/object_scene_registration.py Connects OSR to WorldBelief, embeddings, frustum handling, present-set publication, audit events, and visualization outputs.
dimos/perception/detection/identity_association.py Adds typed association evidence and one-to-one frame assignment policy.
dimos/perception/detection/type/detection3d/object.py Extends object metadata and publishes stable object IDs with stored fitted 3D geometry.
dimos/robot/manipulators/xarm/blueprints/worldbelief.py Adds the xArm6 WorldBelief hardware blueprint and runtime configuration.
dimos/robot/manipulators/xarm/worldbelief_recorder.py Adds a timestamped recorder for camera, detection, pointcloud, and audit streams.
dimos/models/embedding/dino.py Adds a DINOv2 image embedding wrapper for visual identity evidence.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[RealSense RGB-D frames] --> B[YOLO-E detections]
    B --> C[ObjectSceneRegistration]
    A --> C
    C --> D[3D observations and embeddings]
    D --> E[WorldBelief]
    E --> F[Present objects]
    F --> G[Pick and place]
    F --> H[Rerun and Detection3DArray]
    E --> I[WorldBelief history DB]
    I --> J[Rehydrate on restart]
    J --> E
    C --> K[Annotated image and pointcloud]
    K --> H
    A --> L[WorldBelief recorder]
    C --> L
    E --> L
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[RealSense RGB-D frames] --> B[YOLO-E detections]
    B --> C[ObjectSceneRegistration]
    A --> C
    C --> D[3D observations and embeddings]
    D --> E[WorldBelief]
    E --> F[Present objects]
    F --> G[Pick and place]
    F --> H[Rerun and Detection3DArray]
    E --> I[WorldBelief history DB]
    I --> J[Rehydrate on restart]
    J --> E
    C --> K[Annotated image and pointcloud]
    K --> H
    A --> L[WorldBelief recorder]
    C --> L
    E --> L
Loading

Reviews (1): Last reviewed commit: "Initial xArm6 WorldBelief perception st..." | Re-trigger Greptile

visual_embedding_model=a["visual_embedding_model"],
visual_embedding_device=a["visual_embedding_device"],
visual_embedding_dim=a["visual_embedding_dim"],
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Rehydrated IDs Age Out

When the process restarts more than the configured reacquisition window after the last history write, restored entities get window=[] and keep their old last_seen timestamps. The first live frame advances WorldBelief to the current camera time, so restored IDs are not present and are rejected by the reacquisition recency check; the next observation creates a new identity instead of reusing the durable one.

with self._lock:
now = self._frame_time(objects, frame_ts)
self._now = now
scene_established_at_frame_start = len(self._entities) >= max(3, self._min_support)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 History Enables Candidate Gate

After rehydration, len(self._entities) can already satisfy this scene-established check even though none of those restored entities are present. A real object that does not match stale history is then forced through _update_new_candidate() and withheld until it has enough repeated support, so the first valid objects after a restart can be missing from the manipulation-facing present set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant