跳至主要内容
小龙虾小龙虾AI
🤖

UI 元素操作

将 UI 屏幕截图解析为结构化的元素 JSON(类型、OCR 文本、bbox),并从解析的元素操作桌面 UI。当用户要求检测/定位 UI 时使用。

下载143
星标0
版本1.0.2
开发工具
安全通过
⚙️脚本

技能说明


name: ui-element-ops description: Parse UI screenshots into structured element JSON (type, OCR text, bbox) and operate desktop UI from parsed elements. Use when a user asks to detect/locate UI elements, return coordinates, find elements by text/type, wait for element appearance or disappearance, click/type/press keys/hotkeys, take screenshots, or calibrate coordinates for multi-display/DPI/window offsets.

UI Element Ops

Parse one or more screenshots into a machine-readable JSON schema with:

  • type (normalized UI element type)
  • bbox_px and bbox_norm
  • text (OCR/caption content when available)
  • clickable flag
  • optional overlay image with labeled boxes
  • desktop actions via scripts/operate_ui.py (click/type/key/hotkey/screenshot)
  • element query and orchestration via scripts/operate_ui.py (find, wait)
  • coordinate calibration profile for multi-display/DPI/window offset (calibrate)

Quick Start

  1. Prepare runtime once per machine:
skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"
  1. Parse one screenshot:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/1.jpeg
  1. Read outputs:
  • <image>.elements.json
  • <image>.overlay.png
  1. One-step capture + parse with randomized names:
skills/ui-element-ops/scripts/capture_and_parse.sh

Workflow

  1. Confirm screenshot path and desired output path.
  2. Run scripts/bootstrap_omniparser_env.sh when .venv or OmniParser weights are missing.
  3. Run scripts/run_parse_ui.sh for standard parsing.
  4. Report absolute output paths and summary counts: total, clickable, by_type.
  5. Call out obvious quality risks for tiny text or dense icon layouts.
  6. Execute desktop actions when requested:
    • list elements: python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json>
    • find elements: python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login
    • wait for appear/disappear: python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue
    • click by id: python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001
    • screenshot: python3 skills/ui-element-ops/scripts/operate_ui.py screenshot (defaults to user tmp dir)
    • calibrate coordinates: python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size <w> <h> --actual-size <w> <h>

Tunables

  • Edit type mapping keywords in references/type_rules.example.json.
  • Use advanced parser args via scripts/parse_ui.py --help.
  • Use --use-paddleocr only when paddleocr/paddlepaddle are installed.

Outputs

  • Main JSON output:
    • schema_version, pipeline, image, counts, elements
    • each element has id, type, bbox_px, bbox_norm, text, clickable
  • Overlay PNG output:
    • same screenshot with labeled detection boxes

Failure Handling

  • Missing dependencies or weights: run bootstrap script again.
  • Permission/cache errors under $HOME: keep temporary caches under /tmp (handled by run script).
  • CPU-only machine: expect slower inference.
  • Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent elements.json when possible.
  • Headless environment limitation:
    • usable without GUI: parse/list/find/wait/calibrate on existing files.
    • requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.

如何使用「UI 元素操作」?

  1. 打开小龙虾AI(Web 或 iOS App)
  2. 点击上方「立即使用」按钮,或在对话框中输入任务描述
  3. 小龙虾AI 会自动匹配并调用「UI 元素操作」技能完成任务
  4. 结果即时呈现,支持继续对话优化

相关技能