Summarizing thirty minutes of audio on a phone

I’ve been building an Android meeting-recorder that does the whole pipeline locally: transcription, diarisation, and summarisation, with no audio or text ever leaving the phone. The hard part isn’t recording. The hard part is summarising thirty-plus minutes of conversation with a 1B-parameter LLM that can hold roughly 4,000 tokens at a time. This is the five-stage map-reduce pipeline that made it work, the prompts that compress aggressively without lying, and the thing that broke at chunk 14.

The naive version and why it dies

The instinct is correct: transcribe → summarise. Two stages. It works beautifully for two-minute voice memos.

It dies the moment the transcript exceeds the model’s context window. On a 1B-3B model that’s around 4,000 tokens — roughly six minutes of conversational speech. Past that, you have three bad options and one good one:

Approach	What goes wrong
Truncate to last 4k tokens	The summary describes the last six minutes of a forty-minute meeting
Sliding-window summarise	The “summary” is now a summary of summaries of summaries; quality decays with each pass
Pay for a long-context model	Not running on a phone; defeats the entire premise
Chunk → summarise each → combine	Map-reduce. This is the one.

The good option, written out properly, has five stages. None of them is exotic. Getting them to cooperate on a phone is the work.

The five stages, in one picture

┌──────────────────────────────────────────────────────────────────┐
│  STAGE 1: TRANSCRIPTION                                          │
│  whisper.cpp on 30-second chunks with 5s overlap                 │
│  → segments[] {text, start, end, speaker_embedding}              │
└──────────────────────────────────────────────────────────────────┘
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  STAGE 2: DIARISATION + LOGICAL CHUNKING                         │
│  cluster speaker_embeddings; merge same-speaker runs;            │
│  split on (speaker change) OR (pause > 2s) OR (>2500 tokens)     │
│  → LogicalChunk[] {speakers, text, token_count, timestamps}      │
└──────────────────────────────────────────────────────────────────┘
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  STAGE 3: MAP — summarise each chunk in parallel                 │
│  ONNX-runtime 1B LLM, 300-token output budget per chunk          │
│  → ChunkSummary[] {summary, key_points[], actions[]}             │
└──────────────────────────────────────────────────────────────────┘
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  STAGE 4: REDUCE — synthesise summaries                          │
│  if total < 3k tokens: single pass                               │
│  else: hierarchical reduce in groups of 3-4                      │
│  → FinalSummary {text, follow_ups[], action_items[]}             │
└──────────────────────────────────────────────────────────────────┘
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  STAGE 5: POST-PROCESS — restore timestamps, format output       │
└──────────────────────────────────────────────────────────────────┘

Each box does one job. Each box has a way to fail that’s distinct from the boxes around it, which means a problem at any stage gets diagnosed at that stage instead of “the model is bad.”

Stage 1: transcription, the cheap part

whisper.cpp runs on Android via the NDK. I use whisper-small (~470MB on disk, ~250ms latency per 30-second chunk on a Pixel 7a), with 30-second windows and a 5-second overlap to make sure no word gets cut in half across chunk boundaries. The output isn’t text-only; the speaker embedding from each segment travels with it into Stage 2.

The one tip I’d give: stream Stage 1 output into Stage 2 as soon as each chunk is ready. Don’t wait for the entire transcription to finish. Stage 2 needs at least a couple of seconds of overlap before it can confidently cluster speakers, but everything after the first 60 seconds is steady-state — you should be running Stage 2 (and Stage 3) on the early chunks while Stage 1 is still chewing through the later ones.

Stage 2: chunking that respects the conversation

This is the stage that does the actual work of fitting human conversation into model-sized boxes.

Three split signals, evaluated in order:

Speaker change. If the speaker just changed and the current chunk is non-trivial (> 200 tokens), close the chunk.
Long pause. If there’s a >2-second gap in the timestamps, close the chunk.
Hard token cap. If the chunk has hit 2500 tokens regardless of the above, close it anyway.

The token cap is below the model’s 4k limit on purpose — the prompt and instructions cost tokens too, and summarising at 2500 in / 300 out leaves headroom for the inevitable case where someone speaks one extremely long monologue with no pauses, which (in my testing) is a thing about 8% of meetings will do.

A LogicalChunk carries everything downstream needs to put the summary back together:

data class LogicalChunk(
    val id: Int,
    val text: String,
    val speakers: List<String>,
    val startMs: Long,
    val endMs: Long,
    val tokenCount: Int
)

Stage 3: the map step, compressed mercilessly

Each chunk goes to the LLM with a hostile instruction:

<type>{template_type}</type>
<speakers>{speaker_list}</speakers>
<transcript>{chunk_text}</transcript>

Extract in max 80 words:
- SUMMARY: core content
- POINTS: max 3 key points (5 words each)
- ACTIONS: any action items mentioned
- DECISIONS: any decisions made

Be factual. No fluff.

The 80-word cap exists because 1B-class models will, given an inch, fill the entire 4k output budget with prose. The structured format (SUMMARY / POINTS / ACTIONS / DECISIONS) makes the output parseable and gives the model a job it can actually do — extraction is much easier for a small model than freeform summarisation.

The Map step is embarrassingly parallel in principle. In practice it isn’t, because:

One ONNX session in memory, used sequentially for every chunk (loading two sessions on a phone means swapping to disk, which is slower than running them serially).
The Reduce step can start as soon as enough chunk summaries are ready — you don’t have to wait for the full Map pass.

Realistic timing on a Pixel 7a, with Qwen2.5-1.5B-Instruct quantised to Q4_K_M: ~2 seconds per chunk. For a 30-minute meeting that produces ~14 chunks, the Map step takes about 28 seconds wall-clock.

Stage 4: reduce, with a branching strategy

The Reduce step combines the chunk summaries into a final summary. There’s a single-shot path and a hierarchical path, and the choice between them is mechanical:

total_summary_tokens = sum(chunk_summaries.token_count)

if total_summary_tokens < 3000:
    final = llm.summarise(concat(chunk_summaries), template)
else:
    grouped = group_in_threes(chunk_summaries)
    intermediate = [llm.summarise(g, template) for g in grouped]
    final = llm.summarise(concat(intermediate), template)

3000 tokens is the threshold I landed on for a 4k-context model. Below that, a single reduce pass produces noticeably better output. Above that, the model starts to confuse itself — earlier sections get less weight than later ones, and action items from the start of the meeting quietly disappear.

The Reduce prompt is gentler than the Map prompt because the input is already structured and the model doesn’t need bullying:

<type>{template_type}</type>
<segments>
{concatenated_chunk_summaries}
</segments>

Create final {template_type} summary:
1. Synthesize segments chronologically
2. Deduplicate action items
3. Generate 2-3 specific follow-ups

Stage 5: timestamps, restored

The chunk summaries lose timestamp precision (deliberately — the model doesn’t need to know it’s reading the 18:30 mark). Stage 5 re-attaches [start–end] ranges to the action items and key points by matching them back to the LogicalChunk they came from. The final output is markdown with clickable timestamps in the UI.

What broke at chunk 14

Two stories that aren’t in the design docs.

The OOM at chunk 14. First version of the Map step held all chunk summaries in a single ArrayList, then handed the concatenation to the Reduce step. On a 40-minute meeting (~18 chunks), the phone OOMed somewhere around chunk 14 when the Map step’s intermediate buffers and the live ONNX session collided. The fix was so dumb that I should have seen it coming: stream the chunk summaries to disk as they’re produced, hold only an index in memory, and read them back when Reduce starts.

The Qwen3 1B Thinking model wasted half a minute per chunk thinking out loud. I switched from a non-thinking 1B to Qwen3-1B because Twitter said it was better, and discovered every Map step now spent 30-40 seconds emitting <think>…</think> reasoning before producing the actual 80-word extraction. The model is, in fact, better. It is also useless on a phone with this prompt. Three solutions exist (a /no_think directive in the prompt, a --no-thinking flag at the runtime level, or stripping <think> blocks from a custom Modelfile). I went with the prompt-level fix because it was the one I could do without rebuilding the model, then switched back to the non-thinking 1B for production because even at zero thinking budget Qwen3 was slower.

The lesson, written down so I don’t have to learn it twice: the right model for an on-device pipeline is the one that doesn’t try to be clever.

The numbers, end to end

For a 30-minute meeting (~14 chunks), on a Pixel 7a, with whisper-small + Qwen2.5-1.5B-Q4_K_M:

Stage 1 (transcription)    ~9 min wall-clock during the meeting
                           (streaming, not after)
Stage 2 (chunking)         <100ms
Stage 3 (map)              ~28s after meeting ends
Stage 4 (reduce)           ~6s
Stage 5 (post-process)     <50ms

Time-to-summary, end-to-end after meeting ends: ~35 seconds.

Most of the latency the user actually feels is Stage 1, and Stage 1 happens during the meeting in the background — so by the time someone hits “summarise,” there’s effectively only 35 seconds of work left.

What this isn’t

This isn’t going to beat GPT-4 on a long meeting. It’s not trying to. It’s trying to be the good-enough version that runs without internet, without a cloud account, without a subscription, and without the meeting transcript being copied to anyone’s training set.

There are domains where “good enough on a phone, never leaves the phone” is the entire feature. Therapy notes. Legal calls. Customer interviews under NDA. Any conversation where the metadata of who you summarised today matters as much as the content. For those, a 1B-class on-device pipeline isn’t a compromise. It’s the product.