🤖
UI 元素操作
将 UI 屏幕截图解析为结构化的元素 JSON(类型、OCR 文本、bbox),并从解析的元素操作桌面 UI。当用户要求检测/定位 UI 时使用。
安全通过
⚙️脚本
技能说明
name: ui-element-ops description: Parse UI screenshots into structured element JSON (type, OCR text, bbox) and operate desktop UI from parsed elements. Use when a user asks to detect/locate UI elements, return coordinates, find elements by text/type, wait for element appearance or disappearance, click/type/press keys/hotkeys, take screenshots, or calibrate coordinates for multi-display/DPI/window offsets.
UI Element Ops
Parse one or more screenshots into a machine-readable JSON schema with:
type(normalized UI element type)bbox_pxandbbox_normtext(OCR/caption content when available)clickableflag- optional overlay image with labeled boxes
- desktop actions via
scripts/operate_ui.py(click/type/key/hotkey/screenshot) - element query and orchestration via
scripts/operate_ui.py(find,wait) - coordinate calibration profile for multi-display/DPI/window offset (
calibrate)
Quick Start
- Prepare runtime once per machine:
skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"
- Parse one screenshot:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/1.jpeg
- Read outputs:
<image>.elements.json<image>.overlay.png
- One-step capture + parse with randomized names:
skills/ui-element-ops/scripts/capture_and_parse.sh
Workflow
- Confirm screenshot path and desired output path.
- Run
scripts/bootstrap_omniparser_env.shwhen.venvor OmniParser weights are missing. - Run
scripts/run_parse_ui.shfor standard parsing. - Report absolute output paths and summary counts:
total,clickable,by_type. - Call out obvious quality risks for tiny text or dense icon layouts.
- Execute desktop actions when requested:
- list elements:
python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json> - find elements:
python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login - wait for appear/disappear:
python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue - click by id:
python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001 - screenshot:
python3 skills/ui-element-ops/scripts/operate_ui.py screenshot(defaults to user tmp dir) - calibrate coordinates:
python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size <w> <h> --actual-size <w> <h>
- list elements:
Tunables
- Edit type mapping keywords in
references/type_rules.example.json. - Use advanced parser args via
scripts/parse_ui.py --help. - Use
--use-paddleocronly whenpaddleocr/paddlepaddleare installed.
Outputs
- Main JSON output:
schema_version,pipeline,image,counts,elements- each element has
id,type,bbox_px,bbox_norm,text,clickable
- Overlay PNG output:
- same screenshot with labeled detection boxes
Failure Handling
- Missing dependencies or weights: run bootstrap script again.
- Permission/cache errors under
$HOME: keep temporary caches under/tmp(handled by run script). - CPU-only machine: expect slower inference.
- Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent
elements.jsonwhen possible. - Headless environment limitation:
- usable without GUI: parse/list/find/wait/calibrate on existing files.
- requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.
如何使用「UI 元素操作」?
- 打开小龙虾AI(Web 或 iOS App)
- 点击上方「立即使用」按钮,或在对话框中输入任务描述
- 小龙虾AI 会自动匹配并调用「UI 元素操作」技能完成任务
- 结果即时呈现,支持继续对话优化