Immich is great. Running it on a low-power always-on box is great until you ask it to do face recognition on 85,000 photos and the box starts thinking about retirement. My setup splits Immich’s machine-learning workload onto an old GPU box that only wakes up when there’s work, while the always-on N100 keeps serving the UI, database, and uploads. An nginx shim sits between the two and lets the GPU box vanish whenever it wants. This is what the split looks like and why it doesn’t fall over when the GPU box is asleep.
The constraint that drove the design
Two machines I already had:
- An N100 mini-PC running DietPi. 16GB RAM, no discrete GPU, fanless, always on, sips power. Runs my homelab — Authentik, Outline, n8n, the *arr stack, twenty other things. It’s the steady worker.
- An older gaming PC with a GTX 1060 6GB, 32GB RAM, CUDA 12.2. Powerful for what I need. Loud, warm, uses real wattage. Not something I want running 24/7.
Immich, out of the box, expects to run as a single stack on one host. The N100 can do the server, database, web UI, and most jobs at acceptable speeds. What it absolutely cannot do in human time is face detection and CLIP-based smart search on a backlog of 85,000 photos — that’s a ~2-day CPU job, and the box would melt itself.
The GPU box can. In about three hours. The trick is letting it.
The shape of the split
┌────────────────────────────────────────┐ ┌────────────────────────────────┐
│ N100 — always on │ │ GPU box — on-demand │
│ │ │ │
│ immich_server (:2283) │ │ immich_ml (:3003) │
│ ├── immich_postgres │ │ └── CUDA inference │
│ ├── immich_redis │ │ └── model cache │
│ └── immich_ml_proxy (nginx) ──────────┼───▶│ │
│ │ │ not managed by Komodo │
│ Komodo-managed stack "immich" │ │ manual `docker compose up -d` │
│ │ │ when ML work is queued │
└────────────────────────────────────────┘ └────────────────────────────────┘
Always-on parts: The GPU only wakes when:
- serving the UI - I have a backlog
- taking uploads from phones - I run `docker compose up -d`
- storing the database - ML jobs are queued
- running lightweight jobs - I want face recognition to
- queueing ML work actually finish this week
The Immich server (on the N100) thinks it’s talking to a normal machine-learning endpoint. The endpoint URL is http://immich_ml_proxy:80. There is no machine learning happening at immich_ml_proxy. It’s a tiny nginx container on the N100 that forwards to the GPU box.
The nginx shim, in twenty lines
upstream immich_ml {
server gpu-box:3003 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
proxy_connect_timeout 3s; # fail fast when the box is off
proxy_read_timeout 300s; # but be patient once connected
proxy_send_timeout 300s;
location / {
proxy_pass http://immich_ml;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_next_upstream error timeout http_502 http_503;
}
}
Three knobs do the work:
proxy_connect_timeout 3s— if the GPU box is off, the connect attempt fails in three seconds. Immich gets a quick 502, marks the job for retry, and moves on. No 90-second TCP wait, no hung worker.proxy_read/send_timeout 300s— once we are connected, ML jobs can take a few minutes. Don’t disconnect them mid-inference.max_fails=3 fail_timeout=30s— after three failures in a row, the upstream is marked dead for 30 seconds. Immich’s queue keeps trying every job; nginx briefly stops bothering the GPU box.
The result, behaviourally:
| GPU box state | What Immich sees |
|---|---|
| On, healthy | ML jobs return in normal time |
| On, but OOM-restarting | Jobs queue, retry, eventually succeed |
| Off | Each ML job fails in ~3 seconds, queues for retry |
| Coming back online | Within a minute, queue starts draining |
The seam is invisible to me as a user. The phone uploads a photo; if I notice the smart-search doesn’t have it yet, I SSH into the GPU box, docker compose up -d, and within an hour everything is processed. If I forget, the next time I happen to start the GPU box for any reason, it catches up on its own.
Job concurrency, calibrated to the boxes that exist
The other half of the split is in Immich’s job-queue settings — adjusted so the N100 doesn’t melt and the GPU box gets the right work:
| Job | Concurrency | Where it runs | Notes |
|---|---|---|---|
| Thumbnail Generation | 2 | N100 | CPU-bound, kept low so the box stays responsive |
| Metadata Extraction | 3 | N100 | Lightweight, can spike higher |
| Video Conversion | 0 | N100 | Paused — will re-enable once QuickSync hwaccel is wired up |
| Smart Search | 3 | GPU box (via proxy) | Each job uses ~1.2GB VRAM |
| Face Detection | 3 | GPU box (via proxy) | Pair with Smart Search |
| Facial Recognition | 1 | N100 | CPU clustering, not an ML job |
Two of these earned their place painfully.
Video Conversion = 0. First import, I left it on default. The N100 tried to transcode every video on the NAS. The CPU pinned at 100% for hours. The web UI became unreachable. The fix is hardware-accelerated transcoding via Intel QuickSync (/dev/dri passthrough), which I haven’t wired up yet, so for now: zero, paused, no transcoding.
ML workers = 1 on the GPU box. I tried 2 workers because “more is better.” With all three models loaded (smart search, face detection, OCR), 2 workers OOMed the 1060’s 6GB VRAM. The model TTL got tuned down to 300s so models unload when idle, but workers stay at 1.
The thing that took me two weekends
The hardest part wasn’t the architecture. The hardest part was the model selection on a 6GB GPU.
I wanted multilingual CLIP. The natural choice was XLM-Roberta-Large for smart search. Loaded alongside buffalo_l (face detection) and PP-OCRv5 (OCR), it OOMed the GPU. I tried various things:
attempt 1: XLM-RoBERTa-Large + buffalo_l + PP-OCRv5 → OOM
attempt 2: XLM-RoBERTa-Base + buffalo_l + PP-OCRv5 → unstable
attempt 3: ViT-B-32__openai + buffalo_l + PP-OCRv5 → works
attempt 4: same as 3, OCR disabled → comfortable headroom
I ended up shipping (3) — ViT-B-32__openai for smart search (English-only, fine for my photos), keeping OCR on. It’s not the model I wanted; it’s the model the hardware allowed. The lesson is the standard one: on constrained hardware, pick the workload before you pick the model.
What I’d build differently
A wake-on-LAN integration. Right now I manually start the GPU box when I want ML to run. WoL on the N100 calling the GPU box would let Immich queue jobs and the box wake itself when there’s enough backlog to justify it. About an evening of work; haven’t done it.
Health-aware Immich UI. Immich doesn’t surface “the ML host is currently unreachable” in the UI — it just reports jobs as “queued.” I’d like a small badge somewhere that said “GPU host: offline; ML jobs paused.” Open issue, not on me to write.
OIDC via Authentik. I have Authentik. Immich supports OIDC. I haven’t wired it up because the family logs in from their phones and I haven’t faced the dropdown UX yet. On the list.
Why this is worth doing
For 85,000 photos and a 6GB GPU that’s only on a few hours a day, the split:
- Cuts the always-on power draw to N100 levels (~9W idle).
- Makes the GPU box optional, not load-bearing.
- Lets face recognition finish in hours, not days.
- Survives the GPU box being off without operator intervention.
I’d run this same pattern for any home-scale ML workload where the work is bursty and expensive. The shape is general — “always-on cheap host serves the steady state; on-demand expensive host does the bursts; a shim between them hides the seam.” Immich just happens to be the workload I run it for first.
Adjacent: Cloudflare Tunnel as a homelab front door — same homelab, exposed.