Building ScreenSearch — Nicolas Estrem

ScreenSearch is a Windows desktop app I’ve been building since late 2025. It quietly captures the screen, runs OCR on the changed regions, stores text + thumbnails in a local SQLite database with FTS5, and lets me ask questions across it via an embedded 3B LLM. None of the data leaves the machine. This is a tour of the architecture, the decisions I made (and changed), and the parts I had to throw away.

What it is, in one paragraph

A daemon that runs in the system tray. Every three seconds it asks DXGI for whatever’s on screen, checks whether anything actually changed since last frame, and if so, hands the frame to the Windows OCR API. The OCR text goes into a local SQLite database with full-text search. A small embedded LLM (Ministral-3B via llama.cpp) wakes up on demand when I want to query the index in natural language. There’s a tiny localhost HTTP server (Axum) that the UI talks to. That’s it.

The pitch is not “I built Rewind.” Rewind already exists. The pitch is “I wanted to know what a privacy-first version of Rewind would look like if it ran on Windows in Rust, and I wanted to know it badly enough to actually write the thing.”

The architecture, looking at me sideways

┌────────────────────────────────────────────────────────────────┐
│                        ScreenSearch                            │
│                                                                │
│  ┌──────────────┐    ┌────────────┐    ┌────────────────┐      │
│  │ Capture      │───▶│ OCR        │───▶│ DB Manager     │      │
│  │ (screen-     │    │ (screen-   │    │ (screen-db)    │      │
│  │  capture)    │    │  capture)  │    │ SQLite + FTS5  │      │
│  └──────┬───────┘    └──────┬─────┘    └───────┬────────┘      │
│         │ DXGI               │ WinRT OCR        │              │
│         ▼                    ▼                  │              │
│   ┌─────────────┐      ┌──────────────┐         │              │
│   │ frame diff  │      │ language     │         │              │
│   │ (skip dups) │      │ detection    │         │              │
│   └─────────────┘      └──────────────┘         │              │
│                                                 ▼              │
│                                       ┌─────────────────┐      │
│                                       │ API server      │      │
│                                       │ (screen-api)    │      │
│                                       │ Axum localhost  │      │
│                                       └────────┬────────┘      │
│                                                │               │
│                                                ▼               │
│                                       ┌─────────────────┐      │
│                                       │ Embedded LLM    │      │
│                                       │ (screensearch-  │      │
│                                       │  llm)           │      │
│                                       │ llama-server,   │      │
│                                       │ Ministral-3B    │      │
│                                       └─────────────────┘      │
│                                                                │
└────────────────────────────────────────────────────────────────┘
                                ▲
                                │ localhost:3131
                                │
                       ┌────────┴────────┐
                       │ React UI        │
                       │ (timeline,      │
                       │  search, chat)  │
                       └─────────────────┘

It’s seven crates and one UI. The bulk of the code is in the Rust workspace; the UI is a small React app that talks to the API on localhost:3131. Everything outside that diagram — analytics, telemetry, cloud anything — is deliberately absent.

Crate	Job
`screen-capture`	DXGI capture loop, frame differ, OCR via WinRT
`screen-db`	SQLite schema, FTS5, write-ahead-log, sqlx ORM
`screen-api`	Axum HTTP server on `localhost:3131`
`screen-automation`	UI Automation hooks for window-context tagging
`screensearch-embeddings`	ONNX-runtime embedding generation
`screensearch-llm`	`llama-server` lifecycle + OpenAI-compatible client
`src/main.rs`	Glue: config loading, tray, signal handling, supervision

Capture: cheaper than I expected, then more expensive than I expected

The capture loop is where I spent the most time and made the most wrong turns.

The first version used GDI (BitBlt) — easy to write, works on every Windows since the Pleistocene, and capable of capturing about 12 frames per second on my development laptop before the CPU started complaining. That was more than the 1 frame every 3 seconds I actually needed, so I shipped it.

Then a tester ran it on a four-monitor workstation and the daemon ate 18% CPU at idle. GDI’s per-monitor cost is not zero, and four times not-zero is something. I rewrote the capture using DXGI Desktop Duplication, which lets the OS hand you the dirty regions of the screen since the last capture — and at idle, on most frames, the answer is nothing changed; here’s a zero-byte response. CPU at idle on the four-monitor box dropped to 0.9%. The lesson: pull from the OS’s change-tracking instead of doing your own.

The frame differ is the second cheap-then-expensive piece. The naive version compared every pixel; the threshold-tuned version compared a downsampled luma channel and only counted pixels whose delta exceeded a threshold. There’s a config knob (diff_threshold = 0.006) that maps to “0.6% of pixels must change before we save the frame.” Below that, it’s almost certainly a clock blinking or a cursor moving.

pub struct CaptureConfig {
    pub interval_ms: u64,           // 3000
    pub monitor_indices: Vec<usize>, // empty = all
    pub enable_frame_diff: bool,    // true
    pub diff_threshold: f32,        // 0.006
    pub max_frames_buffer: usize,   // 30
    pub include_cursor: bool,       // true
    pub draw_border: bool,          // false
}

I had to add max_frames_buffer after a build that pegged at 4GB resident memory because the OCR pipeline couldn’t keep up and the capture loop was helpfully queueing every untouched frame. Bounded queues are not optional in a real-time pipeline. I now know this in a way that is hard to unlearn.

OCR: the part I almost wrote myself

The first plan was to ship Tesseract. The second plan was to ship a Rust OCR crate I’d seen on lobste.rs. The third plan, which I should have started with, was to use the OCR API that already ships with Windows.

Windows.Media.Ocr is available via WinRT, exposed cleanly to Rust through the windows crate. It’s not state-of-the-art, but it handles eleven languages out of the box, runs entirely locally, and costs me nothing to package. The accuracy on screenshots of actual application UI (which is what 95% of the corpus is) was indistinguishable from Tesseract in a head-to-head test. Tesseract was about 3x slower and added 200MB to the installer.

What I learned: when you’re building a desktop app on Windows, the question to ask first is “does Windows already do this?” The answer is yes more often than you’d think.

Storage: FTS5 is a small miracle

SQLite with the FTS5 extension is the entire “search” half of “screen search.” A virtual table mirrors the OCR text, the writer commits every batch in a single transaction, and queries come back in tens of milliseconds against a corpus of millions of rows.

A few things I learned the awkward way:

WAL mode is not optional. Without it, the OCR writer and the API reader fight each other and one of them loses. With it, they don’t.
Synchronous OCR-then-write is fine for the first 100k rows; after that, no. I moved the FTS insertion behind a small batched worker. The writer side of SQLite hates fast small commits and likes slower bigger ones.
FTS5 contentless tables are tempting and a trap. They save disk; they also remove the ability to recover the original text from FTS rows. I tried it, hated it, reverted.

The schema, lightly elided:

CREATE TABLE captures (
  id            INTEGER PRIMARY KEY,
  captured_at   INTEGER NOT NULL,
  monitor_index INTEGER NOT NULL,
  thumb_path    TEXT,
  window_title  TEXT,
  process_name  TEXT
);

CREATE VIRTUAL TABLE captures_fts USING fts5(
  text,
  window_title,
  process_name,
  content='captures',
  content_rowid='id'
);

-- triggers to keep the FTS table in sync, omitted for sanity

Embedded LLM: the new fun part

The most recent addition is screensearch-llm, which is a thin wrapper around llama.cpp’s llama-server running an instruction-tuned 3B model (Ministral-3B-Instruct, Q4_K_M, GGUF). It speaks the OpenAI-compatible chat-completions API, so the rest of the system pretends it’s talking to a tiny private GPT.

The trick is lifecycle. A 3B model uses about 2.2GB of RAM warm; running it 24/7 is not what I want on a laptop. So:

Lazy load. The server isn’t started until the first chat request lands.
Idle TTL. After 5 minutes of no requests, the process is reaped.
Crash recovery. Up to 3 restarts on llama-server exit. If it crashes a fourth time, give up loudly.
Port fallback. Tries 31130 → 31131 → 31132 because corporate antivirus eats ports for reasons I do not understand.

stateDiagram-v2
    [*] --> Stopped
    Stopped --> Starting: first /chat request
    Starting --> Running: model loaded
    Starting --> Crashed: load fails
    Running --> Stopped: idle TTL expires
    Running --> Crashed: process exits
    Crashed --> Starting: restarts < 3
    Crashed --> [*]: restarts ≥ 3

It uses Vulkan for GPU acceleration, which means it works on Intel, AMD, and NVIDIA without three separate build paths. On my GTX 1060 it’s about 30 tokens/sec; on integrated Intel Iris it’s around 8 tokens/sec — slower, but still fast enough that I haven’t reached for a remote model since I added this.

The whole thing weighs roughly 1.5GB on disk (model + binaries) and feels worth every megabyte.

What I’d build differently if I started over

Start with DXGI. Don’t ship GDI capture even as a “placeholder.” Placeholder code is permanent.

Pick the OS’s OCR first. I’d have saved a week.

Build the API surface before the schema. I built schema-first because it felt rigorous, then spent three days reshaping it once I knew what the UI actually wanted. Going the other way is cheaper.

Don’t add an embedded LLM until the rest works. I added it earlier than I should have, when the OCR pipeline was still rough, and ended up debugging both at once. The lesson is the standard one — fix one layer before adding the next — and I knew it, and I did it anyway.

What this is, in the end

ScreenSearch is the kind of project you build because you want it to exist, then keep building because you want to know whether it can exist on your hardware, on a free stack, without sending anything to anyone. The answer turned out to be yes. It’s not as polished as Rewind, and it never will be, because Rewind has a team and I have one keyboard. But it is mine, and it sits on the laptop and does its job, and every time I find a screenshot from three weeks ago by typing the partial name of a function into a localhost search bar, the project pays for itself again.

If you’re reading this and tempted to build your own version: the moat is smaller than you think, the OS gives you more than you’d expect, and a 3B model in a 4GB envelope can do most of what you actually want a “memory” tool to do.

Code: github.com/nicolasestrem/screensearch. Further reading: Shipping an embedded LLM with a desktop app and Two bottlenecks that killed my capture pipeline.