Skip to content

feat: add coordinate space abstraction for open weights LLM support#282

Open
philipph-askui wants to merge 10 commits into
mainfrom
feat/llm_coordinate_system
Open

feat: add coordinate space abstraction for open weights LLM support#282
philipph-askui wants to merge 10 commits into
mainfrom
feat/llm_coordinate_system

Conversation

@philipph-askui

Copy link
Copy Markdown
Contributor

Introduces VlmCoordinateSpace strategy (pixel, scaled, normalized) so agentOS facades can map model-emitted coordinates to screen pixels.
Thereby adds auto-detection for Qwen, Holo (0-1000 grid) and Kimi (0.0-1.0 floats) in OllamaVlmProvider.
Appends coordinate info to system prompts for OpenAI-compatible providers.

@philipph-askui philipph-askui marked this pull request as ready for review June 12, 2026 13:39

@programminx-askui programminx-askui left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

I've check the PR. Overall nice. We should introduce a ClickableCapability/ClickableTarget.

Comment on lines +81 to +82
image_scaler: ImageScaler | None = None,
max_image_edge: int | None = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything what is images related should start with image_. Similare like model_. Otherwise it's hard to understand the zusammenhänge

Suggested change
image_scaler: ImageScaler | None = None,
max_image_edge: int | None = None,
image_scaler: ImageScaler | None = None,
image_edge_max: int | None = None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is this not a property of the image scaler?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. This value is only used if users do not specify an ImageScaler explicitly but the default one is used.

if self._image_scaler_override is not None:
return self._image_scaler_override
max_edge = self._max_edge
return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure, that we can use the lambda? I had bad experience with lambda in python, that they caach the computation. But can't remember

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to that we now have a proper ImageScaler base class with subclasses that are used here instead.

Comment on lines +74 to +92
self._image_scaler_override = image_scaler
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)

@property
@override
def model_id(self) -> str:
return self._model_id_value

@property
@override
def image_scaler(self) -> ImageScaler:
if self._image_scaler_override is not None:
return self._image_scaler_override
max_edge = self._max_edge
return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so complicated?

Suggested change
self._image_scaler_override = image_scaler
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)
@property
@override
def model_id(self) -> str:
return self._model_id_value
@property
@override
def image_scaler(self) -> ImageScaler:
if self._image_scaler_override is not None:
return self._image_scaler_override
max_edge = self._max_edge
return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)
self._image_scaler = image_scaler if image_scaler else lambda image: compute_patch_optimized_image(image, max_edge=max_edge)
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)
@property
@override
def model_id(self) -> str:
return self._model_id_value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it to use a proper imagescaler class i/o lambdas

Comment on lines +75 to +79
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove it if possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the env variable?

Comment on lines +44 to 46
self._scaler.real_screen_resolution = self._agent_os.screenshot(
report=False
).size

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screen resolution can change over time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value will be update with every new screenshot

self._scaler = CoordinateScaler(
coordinate_space=coordinate_space,
image_scaler=image_scaler,
fetch_real_resolution=lambda: self._agent_os.screenshot(report=False).size,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of lambda in python. What is the reason to use them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the lambdas

self._agent_os: AndroidAgentOs = agent_os
self._target_resolution: Tuple[int, int] = (1024, 768)
self._real_screen_resolution: Optional[Tuple[int, int]] = None
self._scaler = CoordinateScaler(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a note.

The "CoordinateScaler" should be on the DisplayCapability/DisplayTarget/WindowTarget/WindowCapability.

But here we assume, that we have only one ClickableTarget.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment it does not matter after all what target we are scaling for. The llm sees a screenshot and predicts a position on it. All of what we are doing is scaling the predicted coordinates to the correct absolute image coordinates of the screenshot we got

self.target_resolution = scaled.size
return scaled

def scale_coordinates(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function? Should this not replaced from the ScaleCoordinate Layer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the scaleCoordinate Layer :-D or what do you mean?

Comment on lines +71 to +80
# Binary search for largest scale that fits within token budget
lo, hi = 0.0, scale
for _ in range(50):
mid = (lo + hi) / 2
w = max(1, int(width * mid))
h = max(1, int(height * mid))
if count_image_tokens(w, h, patch_size) <= max_tokens:
lo = mid
else:
hi = mid

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? can we move this to a own function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code was copied directly from an example from anthropic, so I would keep it here as is

PixelCoordinateSpace,
ScaledCoordinateSpace,
)
from askui.tools.android.agent_os_facade import AndroidAgentOsFacade

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you checked the tests? I would add here some dynamic tests. with different resolutions and negative tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants