Topic · On-device therapy note generator
On-device therapy note generator for M-series Macs
The question every technical buyer asks first is "how long does one note take on my Mac." Here's the honest table, chip by chip, measured on a real 50-minute session transcript.
TL;DR
TherapyDraft drafts a therapy note entirely on the clinician's Mac — audio in, SOAP/DAP/BIRP/GIRP out, no network socket involved. On an M2 with 16 GB of RAM, a 50-minute session transcribes in ~3:40 and drafts the note in ~50 seconds, for about 4.5 minutes wall-clock end to end. An M4 Pro finishes the same workload in under 2 minutes. Install footprint is roughly 9.5 GB on first run (app binary + whisper large-v3 weights + Qwen 2.5 14B in 4-bit). That's the whole picture.
What "on-device" means here
When a tool claims to be on-device, the important question is "which device." TherapyDraft runs the two heaviest parts of the pipeline — speech-to-text and note drafting — on the Apple Silicon GPU of the clinician's own Mac. Nothing in the audio-to-draft path touches a server we control, a server a third party controls, or even the public internet. The only outbound traffic from the app is a Stripe license check at startup and an anonymous version ping, both governed by the macOS network entitlement allow-list documented on the private AI therapy scribe page.
Runtime architecture
The pipeline is two stages, both running natively on Apple Silicon:
- Transcription. whisper.cpp running the
large-v3model in a 5-bit Core-ML-accelerated build. On Apple Silicon this is the fastest well-audited speech-to-text available that still ships clinical-quality transcription of soft-spoken, emotionally-charged speech. - Drafting. Qwen 2.5 14B Instruct, quantized to 4-bit, running on the MLX runtime. The draft prompt is a short template plus five of your own example notes (the clinician's voice, not a generic medical tone) and the transcribed session. Output is structured into SOAP, DAP, BIRP, or GIRP as selected.
The model is a separately-downloadable artifact, so if the Qwen licensing ever shifts or a better 14B ships, you swap the weights in the settings panel without reinstalling the app.
Latency by chip (50-minute session, 600-token draft)
Measured on clean installs with the default large-v3 transcription and the default 14B drafting model. Timings are wall-clock, from "drag the audio file in" to "draft ready to copy."
| Chip | Unified RAM | Transcribe (50 min audio) | Draft (600 tokens) | Total |
|---|---|---|---|---|
| M1 | 16 GB | ~5:10 | ~1:25 | ~6:35 |
| M2 | 16 GB | ~3:40 | ~0:50 | ~4:30 |
| M2 Pro | 16 GB | ~2:55 | ~0:35 | ~3:30 |
| M3 | 24 GB | ~2:40 | ~0:32 | ~3:12 |
| M3 Pro | 24 GB | ~2:10 | ~0:24 | ~2:34 |
| M4 | 24 GB | ~1:55 | ~0:21 | ~2:16 |
| M4 Pro | 24 GB | ~1:30 | ~0:17 | ~1:47 |
The M1 at 8 GB of RAM is technically supported but marginal — Qwen 14B in 4-bit exceeds the addressable pool once the OS and the whisper decoder are resident, so we recommend 16 GB as a floor. For the 8-GB M1, a compatible 8B model (Llama 3.1 8B in 4-bit, bundled as an alternative) brings the full draft in under five minutes with a small quality trade.
Install footprint
First run downloads all artifacts from a content-addressed CDN. After that everything lives locally:
- App binary — 48 MB.
- whisper
large-v35-bit Core-ML weights — 1.1 GB. - Qwen 2.5 14B Instruct 4-bit MLX weights — 8.2 GB.
- Per-session data (audio + transcript + draft, typical) — 35–90 MB, configurable retention.
All of it lives in ~/Library/Application Support/TherapyDraft, encrypted at rest with keys managed by the macOS keychain. Uninstall deletes the directory and the keychain entries; no residue remains.
Why Apple Silicon specifically
Apple's unified-memory architecture is the thing that makes clinical-quality 14B inference practical on a consumer laptop. A 14B parameter model in 4-bit occupies about 8 GB of memory; on a discrete-GPU Windows machine, fitting that into a 12 GB VRAM card is tight and the CPU still needs its own RAM. On M-series, the same 8 GB is just "8 GB of the system memory," and the GPU reads it directly. That's why the same model runs ~3× faster on an M3 Pro than on a comparably-priced x86 laptop.
Windows support is planned for a later release (see FAQ); it will ship with GGUF + CPU inference fallbacks, with a corresponding latency caveat.
Related questions
What if I'm on battery and step away?
Inference is aggressive; it will use the full GPU. A 50-minute session transcribe on an M2 costs roughly 4–6% of battery. The app deliberately does not background-draft older sessions — that would be a battery drain you can't see.
Does this work with external GPUs?
No. MLX targets the Apple Silicon GPU directly. External GPUs on Apple Silicon Macs aren't supported by Apple for compute in the first place, so this isn't a TherapyDraft limitation.
Can I swap in a different model?
Yes. Settings → Models accepts any MLX-compatible 4-bit quantized model of comparable size. Llama 3.1 8B, Qwen 2.5 7B, and a few community fine-tunes are tested; you can point at your own if you've trained one.