Splitting Immich across two boxes

Immich is great. Running it on a low-power always-on box is great until you ask it to do face recognition on 85,000 photos and the box starts thinking about retirement. My setup splits Immich’s machine-learning workload onto an old GPU box that only wakes up when there’s work, while the always-on N100 keeps serving the UI, database, and uploads. An nginx shim sits between the two and lets the GPU box vanish whenever it wants. This is what the split looks like and why it doesn’t fall over when the GPU box is asleep.

The constraint that drove the design

Two machines I already had:

An N100 mini-PC running DietPi. 16GB RAM, no discrete GPU, fanless, always on, sips power. Runs my homelab — Authentik, Outline, n8n, the *arr stack, twenty other things. It’s the steady worker.
An older gaming PC with a GTX 1060 6GB, 32GB RAM, CUDA 12.2. Powerful for what I need. Loud, warm, uses real wattage. Not something I want running 24/7.

Immich, out of the box, expects to run as a single stack on one host. The N100 can do the server, database, web UI, and most jobs at acceptable speeds. What it absolutely cannot do in human time is face detection and CLIP-based smart search on a backlog of 85,000 photos — that’s a ~2-day CPU job, and the box would melt itself.

The GPU box can. In about three hours. The trick is letting it.

The shape of the split

┌────────────────────────────────────────┐    ┌────────────────────────────────┐
│  N100 — always on                      │    │  GPU box — on-demand           │
│                                        │    │                                │
│  immich_server (:2283)                 │    │  immich_ml (:3003)             │
│  ├── immich_postgres                   │    │    └── CUDA inference          │
│  ├── immich_redis                      │    │    └── model cache             │
│  └── immich_ml_proxy (nginx) ──────────┼───▶│                                │
│                                        │    │  not managed by Komodo         │
│  Komodo-managed stack "immich"         │    │  manual `docker compose up -d` │
│                                        │    │  when ML work is queued        │
└────────────────────────────────────────┘    └────────────────────────────────┘

   Always-on parts:                              The GPU only wakes when:
   - serving the UI                              - I have a backlog
   - taking uploads from phones                  - I run `docker compose up -d`
   - storing the database                        - ML jobs are queued
   - running lightweight jobs                    - I want face recognition to
   - queueing ML work                              actually finish this week

The Immich server (on the N100) thinks it’s talking to a normal machine-learning endpoint. The endpoint URL is http://immich_ml_proxy:80. There is no machine learning happening at immich_ml_proxy. It’s a tiny nginx container on the N100 that forwards to the GPU box.

The nginx shim, in twenty lines

upstream immich_ml {
    server gpu-box:3003 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    proxy_connect_timeout 3s;       # fail fast when the box is off
    proxy_read_timeout    300s;     # but be patient once connected
    proxy_send_timeout    300s;

    location / {
        proxy_pass http://immich_ml;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_next_upstream error timeout http_502 http_503;
    }
}

Three knobs do the work:

proxy_connect_timeout 3s — if the GPU box is off, the connect attempt fails in three seconds. Immich gets a quick 502, marks the job for retry, and moves on. No 90-second TCP wait, no hung worker.
proxy_read/send_timeout 300s — once we are connected, ML jobs can take a few minutes. Don’t disconnect them mid-inference.
max_fails=3 fail_timeout=30s — after three failures in a row, the upstream is marked dead for 30 seconds. Immich’s queue keeps trying every job; nginx briefly stops bothering the GPU box.

The result, behaviourally:

GPU box state	What Immich sees
On, healthy	ML jobs return in normal time
On, but OOM-restarting	Jobs queue, retry, eventually succeed
Off	Each ML job fails in ~3 seconds, queues for retry
Coming back online	Within a minute, queue starts draining

The seam is invisible to me as a user. The phone uploads a photo; if I notice the smart-search doesn’t have it yet, I SSH into the GPU box, docker compose up -d, and within an hour everything is processed. If I forget, the next time I happen to start the GPU box for any reason, it catches up on its own.

Job concurrency, calibrated to the boxes that exist

The other half of the split is in Immich’s job-queue settings — adjusted so the N100 doesn’t melt and the GPU box gets the right work:

Job	Concurrency	Where it runs	Notes
Thumbnail Generation	2	N100	CPU-bound, kept low so the box stays responsive
Metadata Extraction	3	N100	Lightweight, can spike higher
Video Conversion	0	N100	Paused — will re-enable once QuickSync hwaccel is wired up
Smart Search	3	GPU box (via proxy)	Each job uses ~1.2GB VRAM
Face Detection	3	GPU box (via proxy)	Pair with Smart Search
Facial Recognition	1	N100	CPU clustering, not an ML job

Two of these earned their place painfully.

Video Conversion = 0. First import, I left it on default. The N100 tried to transcode every video on the NAS. The CPU pinned at 100% for hours. The web UI became unreachable. The fix is hardware-accelerated transcoding via Intel QuickSync (/dev/dri passthrough), which I haven’t wired up yet, so for now: zero, paused, no transcoding.

ML workers = 1 on the GPU box. I tried 2 workers because “more is better.” With all three models loaded (smart search, face detection, OCR), 2 workers OOMed the 1060’s 6GB VRAM. The model TTL got tuned down to 300s so models unload when idle, but workers stay at 1.

The thing that took me two weekends

The hardest part wasn’t the architecture. The hardest part was the model selection on a 6GB GPU.

I wanted multilingual CLIP. The natural choice was XLM-Roberta-Large for smart search. Loaded alongside buffalo_l (face detection) and PP-OCRv5 (OCR), it OOMed the GPU. I tried various things:

attempt 1: XLM-RoBERTa-Large + buffalo_l + PP-OCRv5  → OOM
attempt 2: XLM-RoBERTa-Base  + buffalo_l + PP-OCRv5  → unstable
attempt 3: ViT-B-32__openai  + buffalo_l + PP-OCRv5  → works
attempt 4: same as 3, OCR disabled                    → comfortable headroom

I ended up shipping (3) — ViT-B-32__openai for smart search (English-only, fine for my photos), keeping OCR on. It’s not the model I wanted; it’s the model the hardware allowed. The lesson is the standard one: on constrained hardware, pick the workload before you pick the model.

What I’d build differently

A wake-on-LAN integration. Right now I manually start the GPU box when I want ML to run. WoL on the N100 calling the GPU box would let Immich queue jobs and the box wake itself when there’s enough backlog to justify it. About an evening of work; haven’t done it.

Health-aware Immich UI. Immich doesn’t surface “the ML host is currently unreachable” in the UI — it just reports jobs as “queued.” I’d like a small badge somewhere that said “GPU host: offline; ML jobs paused.” Open issue, not on me to write.

OIDC via Authentik. I have Authentik. Immich supports OIDC. I haven’t wired it up because the family logs in from their phones and I haven’t faced the dropdown UX yet. On the list.

Why this is worth doing

For 85,000 photos and a 6GB GPU that’s only on a few hours a day, the split:

Cuts the always-on power draw to N100 levels (~9W idle).
Makes the GPU box optional, not load-bearing.
Lets face recognition finish in hours, not days.
Survives the GPU box being off without operator intervention.

I’d run this same pattern for any home-scale ML workload where the work is bursty and expensive. The shape is general — “always-on cheap host serves the steady state; on-demand expensive host does the bursts; a shim between them hides the seam.” Immich just happens to be the workload I run it for first.

Adjacent: Cloudflare Tunnel as a homelab front door — same homelab, exposed.