Skip to content

Add DeepSeek model integration + Fix Linux/Wayland screenshots#270

Open
papadie23 wants to merge 2 commits into
OthersideAI:mainfrom
papadie23:add-deepseek-integration
Open

Add DeepSeek model integration + Fix Linux/Wayland screenshots#270
papadie23 wants to merge 2 commits into
OthersideAI:mainfrom
papadie23:add-deepseek-integration

Conversation

@papadie23

Copy link
Copy Markdown

Summary

Adds DeepSeek as a new model provider and fixes several cross-platform issues.

DeepSeek integration (operate -m deepseek-with-ocr)

  • DeepSeek API uses OpenAI-compatible client (https://api.deepseek.com)
  • Since DeepSeek lacks vision support, screen text is extracted via Tesseract OCR and sent as structured text
  • Uses deepseek-v4-pro by default (configurable via DEEPSEEK_MODEL_NAME)
  • Shows model reasoning/thinking tokens in terminal
  • Custom text-only system prompt for keyboard-first interaction

Screenshot fix for Linux/Wayland

  • PIL ImageGrab is broken on Wayland — replaced with multiple fallbacks:
    • flameshot (primary, works on both X11 and Wayland)
    • gnome-screenshot, mss, ImageGrab as fallbacks

OCR improvements

  • Added fuzzy text matching (handles OCR errors like "Gooale" ≈ "Google")
  • get_text_element() now returns None instead of crashing on miss
  • All OCR-mode functions updated to gracefully skip unresolvable clicks
  • Global EasyOCR reader caching (no more model re-download per loop)

Behavior fixes

  • Strips premature "done" operations — model must verify success before claiming it
  • Smarter delays: 4s after navigation/enter, 2s base
  • Empty-screen detection gives context-aware guidance

Dependencies

  • Bumped requirements.txt pins from == to >= for Python 3.13 compatibility
  • Fixed numpy==1.26.1 (yanked) → >=1.26.2

Files changed (7 files, no new files)

File Change
operate/config.py +30 lines (deepseek_init, validation)
operate/models/apis.py +260 lines (call_deepseek_with_ocr, fixes)
operate/models/prompts.py +65 lines (text-only system prompt)
operate/operate.py +27 lines (done-stripping)
operate/utils/ocr.py +85/-100 (fuzzy matching, None return)
operate/utils/screenshot.py +35 lines (flameshot fallback chain)
requirements.txt dep updates (== → >=, Python 3.13)

- Add  model mode using text-only OCR approach
  (DeepSeek API doesn't support vision, so screen text is extracted
  via Tesseract/EasyOCR and sent as structured text)
- Add  config with OpenAI-compatible client
- Add  for text-only model guidance
- Show DeepSeek reasoning tokens in terminal for transparency

Fixes:
- Replace broken X11 screenshot with flameshot (works on Wayland)
  with fallbacks to gnome-screenshot, mss, then ImageGrab
- Add fuzzy text matching in OCR (diffs can now match 'Gooale' ~ 'Google')
- Return None instead of raising on text-not-found to avoid crashes
- Cache EasyOCR reader globally to avoid re-downloading models each loop
- Strip premature 'done' operations (model must verify before claiming success)
- Smarter delays: 4s after enter/navigation, 2s base
- Update requirements.txt pins to >= for Python 3.13 compatibility
- Fix numpy 1.26.1 -> 1.26.2 (yanked)
@papadie23

Copy link
Copy Markdown
Author

Tested on:

  • Ubuntu 26.04 LTS (resolute), kernel 7.0.0-22
  • GNOME on Wayland (XDG_SESSION_TYPE=wayland)
  • Python 3.13 with conda
  • Screenshot working via flameshot (X11 tools broken on Wayland)

@papadie23

Copy link
Copy Markdown
Author

Testing performed

  • Tested operate -m deepseek-with-ocr on Ubuntu 26.04 Wayland
  • Successfully launched browsers, navigated to URLs via keyboard shortcuts
  • OCR text extraction works (Tesseract finds 150-250 screen text elements)
  • DeepSeek reasoning mode outputs visible chain-of-thought
  • Screenshot capture works across all fallback methods

Known limitations / Future improvements

  • OCR lacks visual context: Tesseract extracts raw text strings but cannot understand UI layout, window focus, or whether a page has fully loaded. The model sometimes types into the wrong window because it does not know which window is focused.
  • Vision model would be ideal: A future enhancement could pair this with a local vision model (e.g. Ollama + llava) to describe the screen visually, then feed that description to DeepSeek for decision-making. This would give the model awareness of window focus, loading states, and spatial layout.
  • EasyOCR click mapping: Click coordinate resolution via OCR text-to-coordinate mapping is still fragile — fuzzy matching helps but is not perfect.
  • Multi-monitor: The current flameshot capture works on primary display. Multi-monitor setups may need display selection.

I plan to keep iterating on these in follow-up PRs.

@papadie23

papadie23 commented Jun 28, 2026

Copy link
Copy Markdown
Author

Note: DeepSeek currently offers file/image upload in their web chat interface (chat.deepseek.com). While their API does not yet expose multimodal/vision endpoints, this suggests vision support may be added to the API in the future. When that happens, the deepseek-with-ocr mode could be upgraded to send screenshots directly as images.

@papadie23

Copy link
Copy Markdown
Author

The README could be updated to add DeepSeek under supported models, e.g.:

#### Try DeepSeek `-m deepseek-with-ocr`
operate -m deepseek-with-ocr

Similar to how Claude, Qwen, and LLaVA are listed. Happy to include that in this PR if maintainers want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant