Shipping an embedded LLM with a desktop app

Why I bundle llama.cpp + Ministral-3B with a Rust app, how the lifecycle works, and what went wrong before it didn't.


ScreenSearch ships a small embedded LLM that runs entirely on the user’s machine. No API keys, no telemetry, no “we promise we don’t look at your data.” This is the lifecycle pattern I landed on: lazy start, idle TTL, port fallback, crash recovery, Vulkan acceleration. It’s about 600 lines of Rust around llama-server and a 1.6GB install footprint, and it’s the single feature that made the rest of the product feel real.

The constraint I started with

I wanted a desktop “memory” app where you could ask questions in natural language across a year of screenshots. The natural-language part is the only part that needed an LLM. Every existing answer involved one of three things I didn’t want:

  1. A hosted API. Wrong on principle for a tool that’s selling “your data never leaves the machine.”
  2. A local server the user runs themselves (Ollama, LM Studio). Fine for me; not fine for “double-click to install.”
  3. Burning the model into the binary. Possible. Also a 2GB binary. Also impossible to update without re-downloading the world.

The fourth option — bundle the inference engine and download the model on first run — is the one I went with. It moves the “is the LLM working” problem from the user’s lap into mine, which is correct: I am the one who chose to have an LLM.

What “embedded” means here

ScreenSearch’s screensearch-llm crate is a thin Rust wrapper around llama.cpp’s llama-server, running an instruction-tuned 3B model (Ministral-3B-Instruct, Q4_K_M quantization, GGUF format). The server speaks an OpenAI-compatible chat-completions API on localhost. The rest of the app talks to it as if it were OpenAI — same HTTP client, same JSON shapes, same streaming protocol.

This is the diagram on a clean day:

flowchart LR
    UI[React UI] -->|/chat| API[screen-api<br/>Axum]
    API -->|spawn / health| MGR[LlamaServer<br/>process manager]
    API -->|inference| ENG[LlmEngine<br/>HTTP client]
    MGR -.controls.-> PROC[llama-server<br/>localhost:31130]
    ENG -->|OpenAI-compatible| PROC
    PROC -->|GGUF| MODEL[(Ministral-3B<br/>Q4_K_M)]
    PROC -->|Vulkan| GPU[(GPU)]
    style PROC fill:#e6f4ea,stroke:#2c7a3f

Four moving parts: a config struct, a process manager, an HTTP client, and a download routine. None of them is exotic. The hard part is the lifecycle.

Lifecycle, the part that took me three weeks

The first version booted llama-server at app startup, kept it warm forever, and gave up about thirty seconds of laptop battery to the gods every minute. That was the simple version, and it was wrong.

The version that ships looks like this:

┌─────────────────────────────────────────────────────┐
│                STATES                               │
│                                                     │
│   ┌───────┐   first /chat   ┌─────────┐             │
│   │Stopped│ ──────────────▶ │Starting │             │
│   └───────┘                 └────┬────┘             │
│       ▲                          │ ready            │
│       │ idle TTL (5 min)         ▼                  │
│       │                     ┌─────────┐             │
│       │                     │ Running │             │
│       │                     └────┬────┘             │
│       │                          │ process exits    │
│       │                          ▼                  │
│       │                     ┌─────────┐             │
│       └──── give up ─────── │Crashed  │             │
│           (restarts ≥ 3)    └────┬────┘             │
│                                  │ restart          │
│                                  ▼                  │
│                             ┌─────────┐             │
│                             │Starting │             │
│                             └─────────┘             │
└─────────────────────────────────────────────────────┘

Five things that took me a deceptive amount of time:

Lazy start. The server isn’t spawned until the first /chat request lands. This is one line of code and one tricky property — the first request takes 2-4 seconds to return its first token, because that’s how long Ministral takes to load into memory. The UI shows a “warming up the model…” message during the cold start. Users tolerate this; they wouldn’t tolerate the daemon eating RAM at idle.

Idle TTL. After 5 minutes of no requests, the manager reaps the process. The model is unloaded; the GPU is released; the laptop stops being warm in the way that makes you check your bag for a fire. Five minutes is the magic number I landed on after watching real usage: the gap between “asked a question, scrolled the answer, thought about it” and “asked a follow-up” is almost always under five minutes; longer gaps almost always mean “got distracted, will come back later.”

Port fallback. The default port is 31130. If that’s taken, try 31131. Then 31132. Then surrender. This took embarrassingly long to add because I assumed nobody else would be using ports in the 31000 range. Corporate antivirus uses ports in the 31000 range. Of course it does.

Crash recovery, with limits. llama-server can exit if you ask it to do something dumb — OOM on a too-long prompt, request a context window that’s bigger than the model supports, that sort of thing. The manager will restart it up to 3 times in 60 seconds. The 4th crash is treated as “the model itself is broken on this machine” and surfaces an error to the UI. Without the cap, a sufficiently weird prompt could turn the manager into an infinite-restart machine, and your laptop fan into an angry insect.

Health monitoring. A background task polls /health every 30 seconds while the server is “Running.” The poll is cheaper than starting the model, and it catches the case where llama-server is technically still a process but has wedged in a way that requests never complete. Wedge detected → kill → restart.

The config struct, in full

pub struct LlmConfig {
    pub model_path: PathBuf,
    pub context_length: u32,          // 4096
    pub n_gpu_layers: i32,            // -1 = all, 0 = CPU only
    pub temperature: f32,             // 0.6
    pub max_tokens: u32,              // 512
    pub port_range: (u16, u16),       // (31130, 31132)
    pub idle_ttl_seconds: u64,        // 300
    pub max_restarts: u32,            // 3
    pub restart_window_seconds: u64,  // 60
    pub health_check_seconds: u64,    // 30
}

There are exactly twelve knobs. I started with thirty-something. Every removal made the code clearer and shipped a more reliable product. If you find yourself adding a config option for “what if a user wants to disable health checks?” — the answer is they don’t, and you don’t want to support what happens if they do.

GPU acceleration without the matrix

I use Vulkan, not CUDA, not ROCm, not Metal. The reason is one binary across NVIDIA, AMD, and Intel GPUs (and integrated graphics). The trade-off is throughput — a CUDA-backed llama.cpp will outperform Vulkan on NVIDIA hardware by maybe 30-40% — but throughput isn’t the bottleneck. The bottleneck is “does this work on the user’s actual machine, the first time, without them installing a 2GB toolkit.”

A small comparison from my own machines:

Hardwaretokens/secFirst-token latency
NVIDIA GTX 1060 (6GB)~30~1.8s
AMD RX 6700 XT (12GB)~42~1.4s
Intel Iris Xe (laptop iGPU)~8~3.2s
CPU only (12-core x86)~6~4.1s

8 tokens/sec on an integrated GPU is not exciting. It is also faster than I can read the answer, and the user got it without installing CUDA, so I’m calling it a win.

Downloads, with progress, with checksums

The model isn’t bundled with the installer. It’s downloaded on first launch from the configured source, with a SHA-256 checksum and a resumable download. About 1.4GB of model and 100MB of llama-server binary, on a checksum-verified, mid-stream-resumable HTTPS download with progress visible to the user.

Three pitfalls I hit:

What I’d tell someone trying this

The lifecycle is the product. Anyone can shell out to llama-server. The thing that makes a bundled LLM feel like a feature instead of a hack is that the user never has to think about it — it starts when they need it, stops when they don’t, recovers from crashes, finds a port, handles a slow download. Every minute spent on the lifecycle is worth more than a minute spent on the prompt engineering.

A 3B model is enough for a lot. I went through 1B, 3B, and 7B during development. 1B was too dumb for anything non-trivial. 7B was great but too heavy for the install footprint and idle GPU usage. 3B is the sweet spot for “summarize this,” “find me what I was reading about X,” and “rewrite this paragraph.” Don’t reach for 70B because Twitter says you should.

OpenAI-compatible is the smart bet. Building the rest of the app against an OpenAI-shaped API meant I could test against real OpenAI early on, swap to a local 3B halfway through, and not change a line of caller code. If the embedded model ever becomes the wrong choice for a particular feature, I can wire that one route to a hosted model without rearchitecting.

The whole thing — model, server, lifecycle, downloader, config — is maybe 600 lines of Rust on top of llama.cpp. The bulk of the file count is tests and error types. It is not, as engineering goes, hard. What it is, is worth doing: it turns the LLM from “a feature I’m renting from someone” into “a feature that ships in the binary.” For a tool whose entire pitch is local-first, that’s not a nice-to-have. It’s the whole thing.


Project: ScreenSearch. Related: Building ScreenSearch.