feat: add coordinate space abstraction for open weights LLM support#282
feat: add coordinate space abstraction for open weights LLM support#282philipph-askui wants to merge 10 commits into
Conversation
Replace fixed SCREENSHOT_RESOLUTION constant with per-provider image scalers. Each VlmProvider now owns an ImageScaler callable and exposes max_image_edge (also via ASKUI_VLM_MAX_IMAGE_EDGE env var). Facades derive target_resolution dynamically from scaler output.
programminx-askui
left a comment
There was a problem hiding this comment.
Hello,
I've check the PR. Overall nice. We should introduce a ClickableCapability/ClickableTarget.
| image_scaler: ImageScaler | None = None, | ||
| max_image_edge: int | None = None, |
There was a problem hiding this comment.
Everything what is images related should start with image_. Similare like model_. Otherwise it's hard to understand the zusammenhänge
| image_scaler: ImageScaler | None = None, | |
| max_image_edge: int | None = None, | |
| image_scaler: ImageScaler | None = None, | |
| image_edge_max: int | None = None, |
There was a problem hiding this comment.
but is this not a property of the image scaler?
There was a problem hiding this comment.
yes. This value is only used if users do not specify an ImageScaler explicitly but the default one is used.
| if self._image_scaler_override is not None: | ||
| return self._image_scaler_override | ||
| max_edge = self._max_edge | ||
| return lambda image: compute_patch_optimized_image(image, max_edge=max_edge) |
There was a problem hiding this comment.
are we sure, that we can use the lambda? I had bad experience with lambda in python, that they caach the computation. But can't remember
There was a problem hiding this comment.
I refactored this to that we now have a proper ImageScaler base class with subclasses that are used here instead.
| self._image_scaler_override = image_scaler | ||
| self._max_edge = ( | ||
| max_image_edge | ||
| or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0")) | ||
| or _DEFAULT_MAX_IMAGE_EDGE | ||
| ) | ||
|
|
||
| @property | ||
| @override | ||
| def model_id(self) -> str: | ||
| return self._model_id_value | ||
|
|
||
| @property | ||
| @override | ||
| def image_scaler(self) -> ImageScaler: | ||
| if self._image_scaler_override is not None: | ||
| return self._image_scaler_override | ||
| max_edge = self._max_edge | ||
| return lambda image: compute_patch_optimized_image(image, max_edge=max_edge) |
There was a problem hiding this comment.
Why so complicated?
| self._image_scaler_override = image_scaler | |
| self._max_edge = ( | |
| max_image_edge | |
| or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0")) | |
| or _DEFAULT_MAX_IMAGE_EDGE | |
| ) | |
| @property | |
| @override | |
| def model_id(self) -> str: | |
| return self._model_id_value | |
| @property | |
| @override | |
| def image_scaler(self) -> ImageScaler: | |
| if self._image_scaler_override is not None: | |
| return self._image_scaler_override | |
| max_edge = self._max_edge | |
| return lambda image: compute_patch_optimized_image(image, max_edge=max_edge) | |
| self._image_scaler = image_scaler if image_scaler else lambda image: compute_patch_optimized_image(image, max_edge=max_edge) | |
| self._max_edge = ( | |
| max_image_edge | |
| or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0")) | |
| or _DEFAULT_MAX_IMAGE_EDGE | |
| ) | |
| @property | |
| @override | |
| def model_id(self) -> str: | |
| return self._model_id_value | |
There was a problem hiding this comment.
I refactored it to use a proper imagescaler class i/o lambdas
| self._max_edge = ( | ||
| max_image_edge | ||
| or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0")) | ||
| or _DEFAULT_MAX_IMAGE_EDGE | ||
| ) |
There was a problem hiding this comment.
I would remove it if possible
There was a problem hiding this comment.
you mean the env variable?
| self._scaler.real_screen_resolution = self._agent_os.screenshot( | ||
| report=False | ||
| ).size |
There was a problem hiding this comment.
The screen resolution can change over time
There was a problem hiding this comment.
this value will be update with every new screenshot
| self._scaler = CoordinateScaler( | ||
| coordinate_space=coordinate_space, | ||
| image_scaler=image_scaler, | ||
| fetch_real_resolution=lambda: self._agent_os.screenshot(report=False).size, |
There was a problem hiding this comment.
I'm not a fan of lambda in python. What is the reason to use them?
There was a problem hiding this comment.
removed the lambdas
| self._agent_os: AndroidAgentOs = agent_os | ||
| self._target_resolution: Tuple[int, int] = (1024, 768) | ||
| self._real_screen_resolution: Optional[Tuple[int, int]] = None | ||
| self._scaler = CoordinateScaler( |
There was a problem hiding this comment.
This is a note.
The "CoordinateScaler" should be on the DisplayCapability/DisplayTarget/WindowTarget/WindowCapability.
But here we assume, that we have only one ClickableTarget.
There was a problem hiding this comment.
at the moment it does not matter after all what target we are scaling for. The llm sees a screenshot and predicts a position on it. All of what we are doing is scaling the predicted coordinates to the correct absolute image coordinates of the screenshot we got
| self.target_resolution = scaled.size | ||
| return scaled | ||
|
|
||
| def scale_coordinates( |
There was a problem hiding this comment.
Do we need this function? Should this not replaced from the ScaleCoordinate Layer?
There was a problem hiding this comment.
this is the scaleCoordinate Layer :-D or what do you mean?
| # Binary search for largest scale that fits within token budget | ||
| lo, hi = 0.0, scale | ||
| for _ in range(50): | ||
| mid = (lo + hi) / 2 | ||
| w = max(1, int(width * mid)) | ||
| h = max(1, int(height * mid)) | ||
| if count_image_tokens(w, h, patch_size) <= max_tokens: | ||
| lo = mid | ||
| else: | ||
| hi = mid |
There was a problem hiding this comment.
do we need this? can we move this to a own function?
There was a problem hiding this comment.
code was copied directly from an example from anthropic, so I would keep it here as is
| PixelCoordinateSpace, | ||
| ScaledCoordinateSpace, | ||
| ) | ||
| from askui.tools.android.agent_os_facade import AndroidAgentOsFacade |
There was a problem hiding this comment.
Did you checked the tests? I would add here some dynamic tests. with different resolutions and negative tests
Introduces VlmCoordinateSpace strategy (pixel, scaled, normalized) so agentOS facades can map model-emitted coordinates to screen pixels.
Thereby adds auto-detection for Qwen, Holo (0-1000 grid) and Kimi (0.0-1.0 floats) in OllamaVlmProvider.
Appends coordinate info to system prompts for OpenAI-compatible providers.