For two weeks I had ScreenSearch capturing what I was doing on my computer. Every night, a different LLM summarised the day’s data into an “intelligence report.” Halfway through, I realised I was running an unintentional experiment: same input, three models. The differences in what each one noticed about my day told me more about each model than any benchmark has. One of them, in particular, kept calling me out in ways I wasn’t ready for.
What I was actually doing
ScreenSearch sat in the tray, watched the screen at 1 frame every 3 seconds, OCR’d anything readable, captured window titles and process names, and dropped everything into a local database. Once a day, a small script would pull the day’s data, format it as a structured prompt, and ask a local LLM to write a paragraph or two about it.
The script was identical for all three models. The prompt was identical. The data was identical. The only thing that changed was the model.
The models, all locally hosted, were:
- Mistral (a 3B for some days, a much larger variant for others)
- Qwen (38B)
- Gemma (312B — yes, that’s the one with the absurd parameter count)
Over twelve days I ran each model on roughly four days of my activity. Then, because I wanted a fair comparison, I asked a fourth model (the assistant I was already using) to read all twelve reports and tell me which one had been the most useful. I didn’t tell it which model had written which.
The verdict was unambiguous. The verdict was also slightly mortifying.
What “the same prompt” looked like
The structured prompt the script sent every day looked like this:
You are reviewing one day of computer activity.
DATA (sorted by start time):
- 06:50-07:18 Vivaldi - "Inbox - nicolas.estrem@gmail.com"
- 07:18-07:42 Cursor.exe - "trackersync — rate-limiter.js"
- 07:42-08:01 Chrome - "Cloudflare Dashboard"
- 08:01-09:30 WindowsTerminal - "wsl"
text: 600 lines of bash session, mostly git + docker
... (50-100 such entries)
WRITE: a one-page intelligence report. Sections:
- executive summary
- key activities by time block
- tools and apps used
- patterns or recommendations
What came back varied wildly.
The Mistral reports kept telling me things I didn’t want to hear
Mistral was the model that, every single time, interpreted the data instead of just listing it.
Sample output, lightly paraphrased to remove personal identifiers:
Executive summary: today’s activity was split between three threads — TrackerSync rate-limiter work in the morning (concentrated, low context-switch), a long midday session jumping between client emails and unrelated documentation tabs (high context-switch, low concentration), and a third evening block where the author was browsing developer YouTube while a code session was active in the background (low intentionality).
Recommendations:
- The midday block consumed three hours and produced two commits. Compare to the morning block: 90 minutes, four commits. The pattern suggests morning is the higher-leverage period and is being underused on most days.
- The evening “background coding while browsing” pattern appeared on 7 of 12 days. Either acknowledge it as social time and stop attempting to code in it, or close the browser tab.
I read that paragraph and felt a small jolt of “how dare you.” I sat with the jolt for a minute. The model was, in fact, not wrong. The model was looking at the same data I had access to and drawing conclusions I had been carefully not drawing. The data said what it said. The interpretation was the part I hadn’t done.
The Mistral reports kept doing this. They:
- Identified the intent behind sequences of actions (“you were clearly building a productivity dashboard for the next two hours”).
- Caught anti-patterns I’d never named (“fragmented session,” “context-switching overhead”).
- One day, it analysed my use of a French culinary-show metaphor inside a business proposal and noted it might not land the way I expected with a non-French reader.
The last one I had to read twice. The model noticed I’d used a metaphor that wouldn’t translate. I’d written that paragraph at midnight, half-asleep, very pleased with myself for the wordplay. The model — looking at the OCR’d text of my own document, after the fact — flagged a tone-mismatch I had not noticed.
This is the moment I started taking the experiment more seriously.
Qwen wrote competent reports
Qwen was the model I had assumed would win. 38B parameters; well-regarded; instruct-tuned; great at coding tasks.
Qwen’s reports were solid. They had clear structure, accurate timelines, clean prose. They captured the right details. They did not interpret. The most “insightful” Qwen report said, more or less:
The author spent the morning on technical deep-dive analysis and the afternoon on productivity-optimisation work. Tools used: Cursor, Chrome, Vivaldi, Windows Terminal, Obsidian. The session reflects a focus on backend development.
That is, technically, what happened. It is also exactly the kind of summary you’d get from a slightly-better-than-template auto-generator. There’s no judgement in it. No “this looks like a habit that’s costing you.” No “this pattern is recurring.” Just a clean restatement of the inputs, in better prose.
I asked Qwen, in a follow-up, to be more critical. It cheerfully produced a more critical report. The criticisms were generic — “consider time-blocking your work,” “reduce context-switching by batching similar tasks.” Both true. Neither personal. Neither connected to my actual data.
Gemma was Qwen with extra adjectives
Gemma (312B; the parameter count was supposed to mean something) produced reports that were structurally identical to Qwen’s. Same headings, same depth, same kind of generic recommendations. Slightly more flowery prose. No additional analytical content.
Gemma did mention specific filenames more often — “the author was editing promptfoo.csv and feedback-app/src/index.ts” — which was correct and useless. I already knew which files I’d been editing. What I didn’t know was whether the time I’d spent on them was justified, and Gemma had no opinion on that.
I want to be fair to Gemma: it’s a very capable model in many other ways. For this specific task — looking at activity data and saying something useful about it — it was hard to distinguish from Qwen, and both were hard to distinguish from a competent template.
Why Mistral won, and why that surprised me
The Mistral reports won for one reason: they had opinions. Not aggressive opinions. Not opinions that were always right. But opinions — provisional, well-anchored to the data, willing to be wrong.
Look at the difference in language:
Qwen/Gemma: “The author spent two hours on documentation.”
Mistral: “The author spent two hours on documentation, distributed across nine sessions averaging 13 minutes each. This shape suggests the documentation work was happening in gaps between other work rather than as a focused block, which the author has previously indicated is their preferred mode.”
The Mistral version contains a claim — “this shape suggests” — that’s testable against the data. The Qwen version contains a statement that’s just an aggregate of the data. The Qwen version is safer. The Mistral version is useful.
This was not what I expected. I had assumed bigger parameter counts would produce more thoughtful output. They produced more competent output, which is a different thing. Mistral was producing something the larger models were not even attempting — and I don’t have a clean theory about why. Training data composition, instruction-tuning approach, something about how Mistral was rewarded during RLHF — I don’t know. What I know is that on this specific task, with this specific input shape, the model with opinions outperformed the models with adjectives.
Three things I took away
For tasks where you want analysis, ask the model to be wrong. I started prepending “It is acceptable to be wrong; make a specific claim that the data either supports or refutes” to my report prompts. Qwen and Gemma started producing slightly better output. They still didn’t catch up to Mistral, but the gap closed. The lesson is that models — especially aggressive instruct-tuned ones — default to safe summarisation, and you have to explicitly invite them to make claims that could be wrong, or they won’t.
Activity data is brutal honesty. I had been telling myself a particular story about how I work — “morning focus, afternoon collaboration, evening winding down.” The data said “morning focus, afternoon scatter, evening pretending to code while watching videos.” The data was right. The story was the story I’d been telling because I liked the story. Anyone could have read the activity data and seen the same thing. I’d been the only one with access, and I’d been the worst-positioned to read it honestly.
Local models are good enough for this kind of work. Nothing in this experiment touched a hosted API. The activity data is exactly the kind of thing I do not want sent anywhere — what I’m working on, who I’m emailing, what tabs I have open. The fact that a local Mistral could produce useful interpretive output from this data, on my hardware, without any of it leaving the machine, is the entire point of the privacy-preserving local-LLM stack. The models don’t need to be the best. They need to be honest with the user, which the local ones, in my experience, more often are.
What I’m doing differently now
The intelligence reports still run, daily, on a local Mistral. I don’t always read them. When I do, I read them with the expectation that they will occasionally say something that mildly offends me, and that the offending bit is usually the part I most need to read.
I’ve stopped pretending the evening sessions are productive. They’re social-shaped, and the data made that clear.
I’ve moved a chunk of my “documentation” work to the morning block, where it apparently goes faster.
I noticed that the model never mentions one particular category of activity (let’s call it “stuff I do when I’m avoiding something I should do”), which suggests either the data is incomplete on that category or the model is being polite. I am content with either explanation. The model can keep that one to itself.
The Mistral that did the heavy lifting was the larger variant, not the 3B. The 3B was about as analytical as Qwen — fine, but template-shaped. The takeaway about “ask the model for opinions” applies to any size; it’s just easier to see the gap at smaller sizes.