Vision pipeline
This page is for anyone who wants to understand why OculiX finds (or fails to find) a match. The TL;DR: under every find() is OpenCV’s matchTemplate plus a feature-matching fallback, wrapped in a JNA layer (Apertix) that avoids the classic Java native-library conflicts.
Stack overview
Section titled “Stack overview” Your script (Jython / Java) │ ▼ sikuli.script.{Screen, Region, Pattern} │ ▼ org.sikuli.script.Finder ←—— similarity, target offset, region clipping │ ▼ Apertix (OpenCV 4.10.0 via JNA) │ ▼ Native OpenCV libs (bundled in JAR)Apertix — why we don’t use vanilla OpenCV
Section titled “Apertix — why we don’t use vanilla OpenCV”OculiX depends on Apertix, a custom JNA-based build of OpenCV 4.10.0. It replaces the more common org.openpnp:opencv artifact for two reasons:
- No
System.loadLibraryconflict. Apertix loads through JNA, which means it doesn’t fight other native libraries also using JNI on Windows (a classic problem when mixing OpenCV with VNC libraries or JFreeChart). - Pinned OpenCV 4.10.0 compiled from source on Windows x86-64 with MSVC. Every OculiX release is built against the exact same OpenCV version, so behavior is reproducible across machines.
Maven coordinates:
<dependency> <groupId>io.github.julienmerconsulting.apertix</groupId> <artifactId>opencv</artifactId> <version>4.10.0-0</version></dependency>Apertix repo: github.com/julienmerconsulting/Apertix.
Template matching — the default
Section titled “Template matching — the default”When you call Region.find("button.png"), OculiX runs OpenCV’s matchTemplate with TM_CCOEFF_NORMED:
- The captured
button.pngbecomes a template. - The current screenshot of the region becomes the scene.
matchTemplateslides the template across every pixel of the scene and computes a normalized correlation score in [0.0, 1.0].- The pixel with the highest score is the candidate match.
- If that score ≥
similarity(default 0.7), OculiX returns aMatch; otherwise it raisesFindFailed.
Template matching is pixel-precise but scale-sensitive: if the same button is rendered 10 % larger on the target screen (high DPI, theme change), the match score drops. Two strategies:
- Adjust similarity:
Pattern("button.png").similar(0.6)widens tolerance. - Re-capture at the target scale: simple, robust, fast.
Feature matching — for resilience
Section titled “Feature matching — for resilience”When template matching fails too often (rotation, scaling, light changes), OculiX falls back to feature matching:
finder = Finder(image)finder.findFeatures("logo.png")if finder.hasNext(): print finder.next()Feature matching uses ORB descriptors (Oriented FAST and Rotated BRIEF). It’s slower than template matching but robust to small rotations, partial occlusion, and moderate scaling. Use it when:
- The target moves within a window (drag-and-drop scenarios)
- The target rotates (compass widgets, rotating progress indicators)
- The same image is shown at multiple sizes (responsive UIs)
Region operations under the hood
Section titled “Region operations under the hood”Region.right(N) doesn’t capture anything new — it just adjusts the search rectangle. The actual screen capture happens lazily at the next find(), wait(), or click() call.
This is why nested find() calls scoped to a small region are dramatically faster than a find() on the whole screen — you’re reducing the number of pixels OpenCV has to scan.
# Good — OpenCV scans 300 × 50 pxbtn = dialog.right(300).find("save.png")
# Bad — OpenCV scans 1920 × 1080 px on every callbtn = Screen(0).find("save.png")Multi-monitor
Section titled “Multi-monitor”Each monitor has its own Screen(n) instance. Screen(0) is the primary. Screen.getNumberScreens() tells you how many you’ve got.
for i in range(Screen.getNumberScreens()): s = Screen(i) print "Screen %d: %d × %d at (%d, %d)" % (i, s.getW(), s.getH(), s.getX(), s.getY())Captures and matches stay within a single Screen unless you explicitly merge regions across them.
Highlight — your debugger
Section titled “Highlight — your debugger”The single most useful tool when a script misbehaves:
match = find("button.png")match.highlight(2) # red box for 2 smatch.highlight(2, "green")You see exactly where OculiX believes the match is. 90 % of “why didn’t it click the right thing?” bugs become obvious within 5 seconds of running with .highlight() added.
Slow Motion mode
Section titled “Slow Motion mode”Run → Run Slow Motion in the IDE adds a brief highlight before every action. Use it for:
- Demoing a script to a non-technical stakeholder
- Debugging an intermittent miss-click
- Recording a screencast of an automation walkthrough
Settings that change the pipeline
Section titled “Settings that change the pipeline”Settings.MinSimilarity = 0.7 # default similarity floorSettings.AlwaysResize = 1.0 # pre-scale captures by N before matchingSettings.WaitScanRate = 3 # OpenCV scans per second during wait()Settings.MoveMouseDelay = 0.3 # cursor glide time (visual feedback)Settings.AutoWaitTimeout = 3.0 # implicit wait before every actionSettings.SaveLastImage = True # dump the last failed match to ./lastImage.pngThe SaveLastImage toggle is gold for debugging in CI: when a find() fails in a headless job, the last captured screen is written to disk for post-mortem.
VNC, ADB, and the same pipeline
Section titled “VNC, ADB, and the same pipeline”The vision pipeline is independent of the source. VNCScreen, ADBScreen, and the local Screen all expose the same find/click/type API — they just produce screenshots from different places. The OpenCV stack downstream doesn’t care whether the image came from your monitor, an Android phone via ADB, or a remote machine via VNC.
# Same script, three different sourceslocal_btn = Screen(0).find("save.png")android_btn = ADBScreen.start(adb_path).find("save.png")remote_btn = VNCScreen.start("192.168.1.10", 5900, "", 1920, 1080).find("save.png")Performance numbers
Section titled “Performance numbers”Rough order of magnitude on a 2024 mid-range laptop, 1920×1080 screen, 100×30 button:
| Operation | Wall-clock time |
|---|---|
Screen(0).capture() | ~30 ms |
find() on whole screen | ~50 ms |
find() on 300×100 region | ~5 ms |
findFeatures() on whole screen | ~200 ms |
Region.text() Tesseract OCR | ~150 ms |
PaddleOCREngine.recognize() | ~300 ms (CPU) |
OCR is roughly 10× slower than image matching. Use image matching whenever the target’s appearance is stable.
- Jython scripting — driving the pipeline from Python
- API reference —
Finder,Pattern,Match,Settings - Visual matching guide — the user-facing recipes