Making a local Whisper server pretend to be OpenAI

I run Whisper on a homelab box for transcription. Plenty of tools (Open WebUI, LM Studio, half the scripts I write) expect to talk to OpenAI’s audio API. Those two facts don’t agree out of the box. This is the FastAPI shim that makes a local Whisper server look exactly like OpenAI’s /v1/audio/transcriptions to anything pointed at it — and the small handful of compatibility traps I hit on the way there.

Why bother

OpenAI’s audio API is a de-facto standard. Tools that need transcription assume that shape: a POST /v1/audio/transcriptions, a Bearer token, a multipart upload, a JSON response with a "text" field. If your tool of choice supports “OpenAI-compatible audio endpoint” and you have a perfectly fine local Whisper running on port 8800 — the gap between those two facts is about fifty lines of Python.

I’d been running a homegrown FastAPI wrapper around whisper-large-v3 that exposed POST /transcribe/ and returned {"transcription": "...", "chunks": [...]}. Structurally similar to OpenAI. Not identical. Open WebUI sneered at it. So I made the local server lie about its identity.

The two-API gap

What OpenAI expects:

POST  https://api.openai.com/v1/audio/transcriptions
Auth: Authorization: Bearer <key>
Body: multipart/form-data with `file` and `model` (e.g. "whisper-1")
Resp: { "text": "...transcribed..." }

What my homegrown server was returning:

POST  http://localhost:8800/transcribe/
Auth: (none)
Body: multipart/form-data with `file`
Resp: { "transcription": "...", "chunks": [...timestamped segments...] }

Five mismatches: endpoint path, optional Authorization header, model form field, response key name, no response_format parameter. Each one is trivial. Together they’re enough to make Open WebUI refuse to talk to me.

The shim, in one file

from fastapi import FastAPI, File, UploadFile, HTTPException, Form
from fastapi.responses import PlainTextResponse
from transformers import pipeline
import tempfile, os

app = FastAPI(title="OpenAI-compatible Whisper API")

# Load model once at startup
pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    return_timestamps=True,
    device="cuda:0",   # or "cpu"
)

@app.post("/v1/audio/transcriptions")
async def transcribe(
    file: UploadFile = File(...),
    model: str = Form("whisper-1"),
    language: str | None = Form(None),
    response_format: str = Form("json"),
    temperature: float = Form(0.0),
):
    if model not in ("whisper-1", "whisper-large-v3"):
        raise HTTPException(400, "Only whisper-1 / whisper-large-v3 supported")

    suffix = os.path.splitext(file.filename or "audio.wav")[1] or ".wav"
    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    try:
        result = pipe(tmp_path, generate_kwargs={"language": language} if language else {})
        text = result["text"]

        if response_format == "text":
            return PlainTextResponse(text)
        if response_format == "verbose_json":
            return {
                "text": text,
                "segments": result.get("chunks", []),
            }
        return {"text": text}
    finally:
        os.unlink(tmp_path)

Three knobs worth their place in the function signature:

model — must be accepted because OpenAI clients send it. I validate against a small allow-list to avoid silently transcribing with the wrong model if someone names a whisper-tiny they don’t have.
language — Whisper auto-detects, but clients sometimes know better. Honour it when sent; ignore when not.
response_format — three values: json (default, just {"text": "..."}), text (plain text, no JSON), verbose_json (text plus timestamps). That’s enough to satisfy every client I’ve thrown at it.

The traps, in order of how much they cost me

Trap 1: form fields vs JSON body. OpenAI clients send form fields, not JSON. If your route signature has model: str (a query/JSON parameter), the client’s form field will arrive as None and you’ll wonder why your validation is rejecting every request. The fix is model: str = Form("whisper-1") — explicit Form() annotation. The error message when you get this wrong is unhelpful in a way I’d describe as personal.

Trap 2: response key name. Mine was transcription. OpenAI’s is text. Open WebUI checks for text specifically; the missing key produces an empty transcript in the UI with no error. I lost forty minutes to “the API works in curl but the UI is blank” before I tailed the right log.

Trap 3: the empty-bearer case. Some clients send Authorization: Bearer with an empty token if no API key is configured. FastAPI’s auth dependencies don’t love this — they treat it as a malformed header. The shim has no auth requirement at all (it’s bound to localhost), so I just ignore the header. If you’re exposing this past localhost, add real auth before you do — see below.

Trap 4: file extension matters more than it should. Whisper’s pipeline routes through ffmpeg, and ffmpeg figures out the codec from the file extension. If you save an uploaded .m4a to a tempfile with no extension, ffmpeg shrugs and gives you silence. Preserve the extension from the upload.

How clients see it now

The same curl that talks to OpenAI talks to my server with one URL change:

# OpenAI
curl -X POST https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F "file=@voice.m4a" -F "model=whisper-1"

# Local
curl -X POST http://localhost:8800/v1/audio/transcriptions \
  -F "file=@voice.m4a" -F "model=whisper-1"

Open WebUI: point its “Audio” connector at http://localhost:8800/v1 and it just works. LM Studio: same. Python scripts using openai SDK: set OPENAI_BASE_URL to http://localhost:8800/v1 and the standard openai.audio.transcriptions.create(...) call works unchanged. That last one is the test that made me sure the shim was right — the official OpenAI Python SDK doesn’t notice the difference.

If you’re exposing this past localhost

Don’t, without auth. The full ffmpeg-into-Whisper pipeline will happily accept any audio anyone wants to send, and Whisper is GPU-heavy. An open transcription endpoint on the internet is a thing other people will use until your GPU is unusable.

The simplest passable auth is a header check:

from fastapi import Request

API_TOKEN = os.environ["LOCAL_WHISPER_TOKEN"]

@app.middleware("http")
async def require_token(request: Request, call_next):
    if request.url.path.startswith("/v1/"):
        header = request.headers.get("authorization", "")
        if header != f"Bearer {API_TOKEN}":
            return JSONResponse({"error": "unauthorized"}, status_code=401)
    return await call_next(request)

Better still: keep the shim on localhost and put it behind something that already does auth (Authentik, a Cloudflare Access policy, whatever you already trust). I run mine behind a Cloudflare Tunnel with an Access policy that lets exactly one device through. Less code, more sleep.

What this is, in fifty lines

It’s a translation layer, nothing more. The model is the same whisper-large-v3 the rest of my stack was already using. The pipeline is the same. The only thing that changed is the contract — the endpoint path, the field names, the optional parameters, the response shape — to look exactly like the one that already won.

The lesson, broadly: when a tool you like assumes an API shape, don’t argue with the tool. Speak its dialect, especially when the cost of speaking its dialect is a handful of fields and a response_format switch. You’ll spend the time you save on things that aren’t string-renaming.