Splitting Immich across two boxes

Why my always-on N100 runs the database and an old GPU box only wakes up for ML — and the nginx availability-shim that makes the seam invisible.


Immich is great. Running it on a low-power always-on box is great until you ask it to do face recognition on 85,000 photos and the box starts thinking about retirement. My setup splits Immich’s machine-learning workload onto an old GPU box that only wakes up when there’s work, while the always-on N100 keeps serving the UI, database, and uploads. An nginx shim sits between the two and lets the GPU box vanish whenever it wants. This is what the split looks like and why it doesn’t fall over when the GPU box is asleep.

The constraint that drove the design

Two machines I already had:

Immich, out of the box, expects to run as a single stack on one host. The N100 can do the server, database, web UI, and most jobs at acceptable speeds. What it absolutely cannot do in human time is face detection and CLIP-based smart search on a backlog of 85,000 photos — that’s a ~2-day CPU job, and the box would melt itself.

The GPU box can. In about three hours. The trick is letting it.

The shape of the split

┌────────────────────────────────────────┐    ┌────────────────────────────────┐
│  N100 — always on                      │    │  GPU box — on-demand           │
│                                        │    │                                │
│  immich_server (:2283)                 │    │  immich_ml (:3003)             │
│  ├── immich_postgres                   │    │    └── CUDA inference          │
│  ├── immich_redis                      │    │    └── model cache             │
│  └── immich_ml_proxy (nginx) ──────────┼───▶│                                │
│                                        │    │  not managed by Komodo         │
│  Komodo-managed stack "immich"         │    │  manual `docker compose up -d` │
│                                        │    │  when ML work is queued        │
└────────────────────────────────────────┘    └────────────────────────────────┘

   Always-on parts:                              The GPU only wakes when:
   - serving the UI                              - I have a backlog
   - taking uploads from phones                  - I run `docker compose up -d`
   - storing the database                        - ML jobs are queued
   - running lightweight jobs                    - I want face recognition to
   - queueing ML work                              actually finish this week

The Immich server (on the N100) thinks it’s talking to a normal machine-learning endpoint. The endpoint URL is http://immich_ml_proxy:80. There is no machine learning happening at immich_ml_proxy. It’s a tiny nginx container on the N100 that forwards to the GPU box.

The nginx shim, in twenty lines

upstream immich_ml {
    server gpu-box:3003 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    proxy_connect_timeout 3s;       # fail fast when the box is off
    proxy_read_timeout    300s;     # but be patient once connected
    proxy_send_timeout    300s;

    location / {
        proxy_pass http://immich_ml;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_next_upstream error timeout http_502 http_503;
    }
}

Three knobs do the work:

The result, behaviourally:

GPU box stateWhat Immich sees
On, healthyML jobs return in normal time
On, but OOM-restartingJobs queue, retry, eventually succeed
OffEach ML job fails in ~3 seconds, queues for retry
Coming back onlineWithin a minute, queue starts draining

The seam is invisible to me as a user. The phone uploads a photo; if I notice the smart-search doesn’t have it yet, I SSH into the GPU box, docker compose up -d, and within an hour everything is processed. If I forget, the next time I happen to start the GPU box for any reason, it catches up on its own.

Job concurrency, calibrated to the boxes that exist

The other half of the split is in Immich’s job-queue settings — adjusted so the N100 doesn’t melt and the GPU box gets the right work:

JobConcurrencyWhere it runsNotes
Thumbnail Generation2N100CPU-bound, kept low so the box stays responsive
Metadata Extraction3N100Lightweight, can spike higher
Video Conversion0N100Paused — will re-enable once QuickSync hwaccel is wired up
Smart Search3GPU box (via proxy)Each job uses ~1.2GB VRAM
Face Detection3GPU box (via proxy)Pair with Smart Search
Facial Recognition1N100CPU clustering, not an ML job

Two of these earned their place painfully.

Video Conversion = 0. First import, I left it on default. The N100 tried to transcode every video on the NAS. The CPU pinned at 100% for hours. The web UI became unreachable. The fix is hardware-accelerated transcoding via Intel QuickSync (/dev/dri passthrough), which I haven’t wired up yet, so for now: zero, paused, no transcoding.

ML workers = 1 on the GPU box. I tried 2 workers because “more is better.” With all three models loaded (smart search, face detection, OCR), 2 workers OOMed the 1060’s 6GB VRAM. The model TTL got tuned down to 300s so models unload when idle, but workers stay at 1.

The thing that took me two weekends

The hardest part wasn’t the architecture. The hardest part was the model selection on a 6GB GPU.

I wanted multilingual CLIP. The natural choice was XLM-Roberta-Large for smart search. Loaded alongside buffalo_l (face detection) and PP-OCRv5 (OCR), it OOMed the GPU. I tried various things:

attempt 1: XLM-RoBERTa-Large + buffalo_l + PP-OCRv5  → OOM
attempt 2: XLM-RoBERTa-Base  + buffalo_l + PP-OCRv5  → unstable
attempt 3: ViT-B-32__openai  + buffalo_l + PP-OCRv5  → works
attempt 4: same as 3, OCR disabled                    → comfortable headroom

I ended up shipping (3) — ViT-B-32__openai for smart search (English-only, fine for my photos), keeping OCR on. It’s not the model I wanted; it’s the model the hardware allowed. The lesson is the standard one: on constrained hardware, pick the workload before you pick the model.

What I’d build differently

A wake-on-LAN integration. Right now I manually start the GPU box when I want ML to run. WoL on the N100 calling the GPU box would let Immich queue jobs and the box wake itself when there’s enough backlog to justify it. About an evening of work; haven’t done it.

Health-aware Immich UI. Immich doesn’t surface “the ML host is currently unreachable” in the UI — it just reports jobs as “queued.” I’d like a small badge somewhere that said “GPU host: offline; ML jobs paused.” Open issue, not on me to write.

OIDC via Authentik. I have Authentik. Immich supports OIDC. I haven’t wired it up because the family logs in from their phones and I haven’t faced the dropdown UX yet. On the list.

Why this is worth doing

For 85,000 photos and a 6GB GPU that’s only on a few hours a day, the split:

I’d run this same pattern for any home-scale ML workload where the work is bursty and expensive. The shape is general — “always-on cheap host serves the steady state; on-demand expensive host does the bursts; a shim between them hides the seam.” Immich just happens to be the workload I run it for first.


Adjacent: Cloudflare Tunnel as a homelab front door — same homelab, exposed.