feat: add coordinate space abstraction for open weights LLM support by philipph-askui · Pull Request #282 · askui/python-sdk

philipph-askui · 2026-06-09T14:38:30Z

Introduces VlmCoordinateSpace strategy (pixel, scaled, normalized) so agentOS facades can map model-emitted coordinates to screen pixels.
Thereby adds auto-detection for Qwen, Holo (0-1000 grid) and Kimi (0.0-1.0 floats) in OllamaVlmProvider.
Appends coordinate info to system prompts for OpenAI-compatible providers.

Replace fixed SCREENSHOT_RESOLUTION constant with per-provider image scalers. Each VlmProvider now owns an ImageScaler callable and exposes max_image_edge (also via ASKUI_VLM_MAX_IMAGE_EDGE env var). Facades derive target_resolution dynamically from scaler output.

…ystem

programminx-askui

Hello,

I've check the PR. Overall nice. We should introduce a ClickableCapability/ClickableTarget.

programminx-askui · 2026-06-17T18:55:30Z

+        image_scaler: ImageScaler | None = None,
+        max_image_edge: int | None = None,


Everything what is images related should start with image_. Similare like model_. Otherwise it's hard to understand the zusammenhänge

Suggested change

image_scaler: ImageScaler | None = None,

max_image_edge: int | None = None,

image_scaler: ImageScaler | None = None,

image_edge_max: int | None = None,

but is this not a property of the image scaler?

yes. This value is only used if users do not specify an ImageScaler explicitly but the default one is used.

programminx-askui · 2026-06-17T18:57:31Z

+        if self._image_scaler_override is not None:
+            return self._image_scaler_override
+        max_edge = self._max_edge
+        return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)


are we sure, that we can use the lambda? I had bad experience with lambda in python, that they caach the computation. But can't remember

I refactored this to that we now have a proper ImageScaler base class with subclasses that are used here instead.

programminx-askui · 2026-06-17T18:59:42Z

+        self._image_scaler_override = image_scaler
+        self._max_edge = (
+            max_image_edge
+            or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
+            or _DEFAULT_MAX_IMAGE_EDGE
+        )

    @property
    @override
    def model_id(self) -> str:
        return self._model_id_value

+    @property
+    @override
+    def image_scaler(self) -> ImageScaler:
+        if self._image_scaler_override is not None:
+            return self._image_scaler_override
+        max_edge = self._max_edge
+        return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)


Why so complicated?

Suggested change

self._image_scaler_override = image_scaler

self._max_edge = (

max_image_edge

or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))

or _DEFAULT_MAX_IMAGE_EDGE

)

@property

@override

def model_id(self) -> str:

return self._model_id_value

@property

@override

def image_scaler(self) -> ImageScaler:

if self._image_scaler_override is not None:

return self._image_scaler_override

max_edge = self._max_edge

return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)

self._image_scaler = image_scaler if image_scaler else lambda image: compute_patch_optimized_image(image, max_edge=max_edge)

self._max_edge = (

max_image_edge

or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))

or _DEFAULT_MAX_IMAGE_EDGE

)

@property

@override

def model_id(self) -> str:

return self._model_id_value

I refactored it to use a proper imagescaler class i/o lambdas

programminx-askui · 2026-06-17T19:00:36Z

+        self._max_edge = (
+            max_image_edge
+            or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
+            or _DEFAULT_MAX_IMAGE_EDGE
+        )


I would remove it if possible

you mean the env variable?

programminx-askui · 2026-06-17T19:35:10Z

+        self._scaler.real_screen_resolution = self._agent_os.screenshot(
+            report=False
        ).size


The screen resolution can change over time

this value will be update with every new screenshot

programminx-askui · 2026-06-17T19:38:28Z

+        self._scaler = CoordinateScaler(
+            coordinate_space=coordinate_space,
+            image_scaler=image_scaler,
+            fetch_real_resolution=lambda: self._agent_os.screenshot(report=False).size,


I'm not a fan of lambda in python. What is the reason to use them?

removed the lambdas

programminx-askui · 2026-06-18T07:09:35Z

        self._agent_os: AndroidAgentOs = agent_os
-        self._target_resolution: Tuple[int, int] = (1024, 768)
-        self._real_screen_resolution: Optional[Tuple[int, int]] = None
+        self._scaler = CoordinateScaler(


This is a note.

The "CoordinateScaler" should be on the DisplayCapability/DisplayTarget/WindowTarget/WindowCapability.

But here we assume, that we have only one ClickableTarget.

at the moment it does not matter after all what target we are scaling for. The llm sees a screenshot and predicts a position on it. All of what we are doing is scaling the predicted coordinates to the correct absolute image coordinates of the screenshot we got

programminx-askui · 2026-06-18T07:16:48Z

+        self.target_resolution = scaled.size
+        return scaled
+
+    def scale_coordinates(


Do we need this function? Should this not replaced from the ScaleCoordinate Layer?

this is the scaleCoordinate Layer :-D or what do you mean?

programminx-askui · 2026-06-18T07:18:13Z

+    # Binary search for largest scale that fits within token budget
+    lo, hi = 0.0, scale
+    for _ in range(50):
+        mid = (lo + hi) / 2
+        w = max(1, int(width * mid))
+        h = max(1, int(height * mid))
+        if count_image_tokens(w, h, patch_size) <= max_tokens:
+            lo = mid
+        else:
+            hi = mid


do we need this? can we move this to a own function?

code was copied directly from an example from anthropic, so I would keep it here as is

programminx-askui · 2026-06-18T07:21:47Z

+    PixelCoordinateSpace,
+    ScaledCoordinateSpace,
+)
+from askui.tools.android.agent_os_facade import AndroidAgentOsFacade


Did you checked the tests? I would add here some dynamic tests. with different resolutions and negative tests

philipph-askui added 7 commits June 9, 2026 16:37

feat: add coordinate space abstraction for open weights LLM support

8f496af

fix: map non-pixel coordinate spaces directly to device resolution

31865a8

refactor: clean up PR (composition, deduplication, exports)

3665cc4

chore: fine-tune settings for MAX_IMAGE_EDGE

cca155f

Merge remote-tracking branch 'origin/main' into feat/llm_coordinate_s…

184ac9c

…ystem

fix: outdated cos test for kimi

d6415eb

philipph-askui marked this pull request as ready for review June 12, 2026 13:39

philipph-askui requested review from mlikasam-askui and programminx-askui June 12, 2026 13:39

programminx-askui reviewed Jun 18, 2026

View reviewed changes

philipph-askui added 3 commits June 18, 2026 14:22

feat: integrate features from PR #283 (tool result image scaling)

fabd05d

address review remarks

d3bde3f

chore: rename coordinateScaler file

26cf7c7

philipph-askui mentioned this pull request Jun 18, 2026

Feat/tool result image scaling #283

Closed

		image_scaler: ImageScaler \| None = None,
		max_image_edge: int \| None = None,

Conversation

philipph-askui commented Jun 9, 2026

Uh oh!

programminx-askui left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants