Add DeepSeek model integration + Fix Linux/Wayland screenshots by papadie23 · Pull Request #270 · OthersideAI/self-operating-computer

papadie23 · 2026-06-28T10:36:17Z

Summary

Adds DeepSeek as a new model provider and fixes several cross-platform issues.

DeepSeek integration (`operate -m deepseek-with-ocr`)

DeepSeek API uses OpenAI-compatible client (https://api.deepseek.com)
Since DeepSeek lacks vision support, screen text is extracted via Tesseract OCR and sent as structured text
Uses deepseek-v4-pro by default (configurable via DEEPSEEK_MODEL_NAME)
Shows model reasoning/thinking tokens in terminal
Custom text-only system prompt for keyboard-first interaction

Screenshot fix for Linux/Wayland

PIL ImageGrab is broken on Wayland — replaced with multiple fallbacks:
- flameshot (primary, works on both X11 and Wayland)
- gnome-screenshot, mss, ImageGrab as fallbacks

OCR improvements

Added fuzzy text matching (handles OCR errors like "Gooale" ≈ "Google")
get_text_element() now returns None instead of crashing on miss
All OCR-mode functions updated to gracefully skip unresolvable clicks
Global EasyOCR reader caching (no more model re-download per loop)

Behavior fixes

Strips premature "done" operations — model must verify success before claiming it
Smarter delays: 4s after navigation/enter, 2s base
Empty-screen detection gives context-aware guidance

Dependencies

Bumped requirements.txt pins from == to >= for Python 3.13 compatibility
Fixed numpy==1.26.1 (yanked) → >=1.26.2

Files changed (7 files, no new files)

File	Change
`operate/config.py`	+30 lines (deepseek_init, validation)
`operate/models/apis.py`	+260 lines (call_deepseek_with_ocr, fixes)
`operate/models/prompts.py`	+65 lines (text-only system prompt)
`operate/operate.py`	+27 lines (done-stripping)
`operate/utils/ocr.py`	+85/-100 (fuzzy matching, None return)
`operate/utils/screenshot.py`	+35 lines (flameshot fallback chain)
`requirements.txt`	dep updates (== → >=, Python 3.13)

- Add model mode using text-only OCR approach (DeepSeek API doesn't support vision, so screen text is extracted via Tesseract/EasyOCR and sent as structured text) - Add config with OpenAI-compatible client - Add for text-only model guidance - Show DeepSeek reasoning tokens in terminal for transparency Fixes: - Replace broken X11 screenshot with flameshot (works on Wayland) with fallbacks to gnome-screenshot, mss, then ImageGrab - Add fuzzy text matching in OCR (diffs can now match 'Gooale' ~ 'Google') - Return None instead of raising on text-not-found to avoid crashes - Cache EasyOCR reader globally to avoid re-downloading models each loop - Strip premature 'done' operations (model must verify before claiming success) - Smarter delays: 4s after enter/navigation, 2s base - Update requirements.txt pins to >= for Python 3.13 compatibility - Fix numpy 1.26.1 -> 1.26.2 (yanked)

papadie23 · 2026-06-28T10:36:25Z

Tested on:

Ubuntu 26.04 LTS (resolute), kernel 7.0.0-22
GNOME on Wayland (XDG_SESSION_TYPE=wayland)
Python 3.13 with conda
Screenshot working via flameshot (X11 tools broken on Wayland)

papadie23 · 2026-06-28T10:38:28Z

Testing performed

Tested operate -m deepseek-with-ocr on Ubuntu 26.04 Wayland
Successfully launched browsers, navigated to URLs via keyboard shortcuts
OCR text extraction works (Tesseract finds 150-250 screen text elements)
DeepSeek reasoning mode outputs visible chain-of-thought
Screenshot capture works across all fallback methods

Known limitations / Future improvements

OCR lacks visual context: Tesseract extracts raw text strings but cannot understand UI layout, window focus, or whether a page has fully loaded. The model sometimes types into the wrong window because it does not know which window is focused.
Vision model would be ideal: A future enhancement could pair this with a local vision model (e.g. Ollama + llava) to describe the screen visually, then feed that description to DeepSeek for decision-making. This would give the model awareness of window focus, loading states, and spatial layout.
EasyOCR click mapping: Click coordinate resolution via OCR text-to-coordinate mapping is still fragile — fuzzy matching helps but is not perfect.
Multi-monitor: The current flameshot capture works on primary display. Multi-monitor setups may need display selection.

I plan to keep iterating on these in follow-up PRs.

papadie23 · 2026-06-28T10:41:13Z

Note: DeepSeek currently offers file/image upload in their web chat interface (chat.deepseek.com). While their API does not yet expose multimodal/vision endpoints, this suggests vision support may be added to the API in the future. When that happens, the deepseek-with-ocr mode could be upgraded to send screenshots directly as images.

papadie23 · 2026-06-28T10:45:54Z

The README could be updated to add DeepSeek under supported models, e.g.:

#### Try DeepSeek `-m deepseek-with-ocr`
operate -m deepseek-with-ocr

Similar to how Claude, Qwen, and LLaVA are listed. Happy to include that in this PR if maintainers want.

Fix coordinates crash, enable mouse clicks, improve focus prompt

0a4e26e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DeepSeek model integration + Fix Linux/Wayland screenshots#270

Add DeepSeek model integration + Fix Linux/Wayland screenshots#270
papadie23 wants to merge 2 commits into
OthersideAI:mainfrom
papadie23:add-deepseek-integration

papadie23 commented Jun 28, 2026

Uh oh!

papadie23 commented Jun 28, 2026

Uh oh!

papadie23 commented Jun 28, 2026

Uh oh!

papadie23 commented Jun 28, 2026 •

edited

Loading

Uh oh!

papadie23 commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

papadie23 commented Jun 28, 2026

Summary

DeepSeek integration (operate -m deepseek-with-ocr)

Screenshot fix for Linux/Wayland

OCR improvements

Behavior fixes

Dependencies

Files changed (7 files, no new files)

Uh oh!

papadie23 commented Jun 28, 2026

Uh oh!

papadie23 commented Jun 28, 2026

Testing performed

Known limitations / Future improvements

Uh oh!

papadie23 commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

papadie23 commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DeepSeek integration (`operate -m deepseek-with-ocr`)

papadie23 commented Jun 28, 2026 •

edited

Loading