Topic · On-device inference
Local AI therapy notes: whisper.cpp + a quantized 14B on your M-series Mac
Until recently, "run the whole AI-scribe stack on a laptop" was aspirational. In 2026 it's a signed-and-notarized desktop app. Here's what it actually takes, and what the runtime numbers look like on each M-series chip.
TL;DR
Local AI therapy notes means: the audio recording, the transcript, and the drafted note all live exclusively on your Mac. TherapyDraft pairs whisper.cpp (Core ML-accelerated transcription) with a 4-bit quantized 14B-parameter language model running on Apple's MLX framework. A 50-minute session transcribes in 2–4 minutes on an M2, and the first-pass SOAP draft lands in another 45–75 seconds. Nothing crosses a network socket.
The two components, and why they matter
A therapy-note scribe is two models glued together: one that turns audio into words, and one that turns words into a structured clinical note. Cloud scribes run both in someone else's data center. A local scribe needs each piece to be small enough, fast enough, and accurate enough to fit inside a therapist's Mac without taking over it.
Audio → transcript: whisper.cpp
TherapyDraft uses OpenAI's Whisper large-v3 model via whisper.cpp, quantized to 5-bit and Core ML-accelerated. This model is open-weights and has excellent handling of the slow, pause-heavy cadence of therapy sessions — a cadence that trips up VAD-based cloud pipelines. On an M2 with 16 GB of unified memory, transcription runs roughly 15–25× real-time for the large-v3 quant; a 50-minute session lands in about three minutes of wall time.
Transcript → SOAP draft: Qwen 2.5 14B, MLX 4-bit
Drafting is handled by a 14-billion-parameter model (Qwen 2.5 14B Instruct by default) quantized to 4 bits and served via Apple's MLX framework. The quantized weights occupy about 8 GB on disk. Generation runs at 28–42 tokens per second on an M2, so a ~600-token SOAP note takes under a minute. The model is a swappable artifact — you can pick a 7B alternative for M1 Air comfort, or bring your own fine-tune if you prefer a different clinical voice.
Install footprint
| Component | Disk | RAM (peak) |
|---|---|---|
| App binary (signed/notarized .app) | ~120 MB | — |
| Whisper large-v3 (5-bit) | ~1.1 GB | ~1.8 GB |
| Qwen 2.5 14B (4-bit MLX) | ~8.4 GB | ~10 GB |
| Session data (per session) | <60 MB | — |
First-run download is ~9.5 GB from TherapyDraft's CDN (model artifacts are content-addressed and signed). After that, the app runs offline indefinitely. We also honor the macOS "Low Power Mode" setting — if you're on battery and toggle it on, we fall back to a 7B model for drafting so your laptop doesn't get warm.
Latency by chip, real-world numbers
| Chip | RAM | Transcribe (50-min session) | Draft (SOAP, ~600 tok) |
|---|---|---|---|
| M1 | 16 GB | ~4:30 | ~110 s |
| M2 | 16 GB | ~3:00 | ~70 s |
| M2 Pro / M3 | 16 GB | ~2:15 | ~55 s |
| M3 Pro / M4 | 24 GB | ~1:45 | ~42 s |
These numbers are from internal benchmarks on six redacted test sessions (release candidates, not promises). Total wall-time from "stop recording" to "note in clipboard" is under five minutes on every supported chip — faster than most clinicians write the note by hand, and comparable to cloud scribes after you factor in upload time for a full session's audio.
Where the data actually lives
The app's working directory is ~/Library/Application Support/TherapyDraft. Session audio is stored there encrypted at rest with an OS-keychain-managed key. Transcripts and drafts live alongside. A "purge session" action wipes all three files and the log entries that reference them. Nothing is synced to iCloud, nothing is emailed, nothing is uploaded — the network entitlement literally does not permit a socket to any host other than Stripe and our anonymous update server. See the privacy page for the exact entitlement declaration.
What "local" means for templates and voice
Local doesn't have to mean worse. TherapyDraft's drafting model accepts a style pack: paste five of your past notes once, and the model matches your phrasing and section order on every draft going forward. Because this happens on-device, your "voice" never becomes a training signal for anyone else's model — it's a rendered prompt that lives in your Application Support folder. SOAP, DAP, BIRP, and GIRP templates ship by default; others are a text file away. See the how-it-works section for a walkthrough.
How to try it
The private beta opens this quarter for solo practitioners on M-series Macs. Join the waitlist on the homepage — you'll get an invite link and first-run instructions by email. Pricing starts at $39/mo solo; the full breakdown is on the pricing page.
Related questions
Does it work offline?
Yes, once installed. The only network calls the app makes are a monthly license check to Stripe and an anonymous app-version check. Neither touches session data, and both can be temporarily skipped if your office Wi-Fi drops.
Can I use an external GPU or a Linux box?
Not yet. The entire stack is tuned to Apple Silicon's unified memory architecture — we don't want to ship a mediocre CUDA port until we've nailed the M-series experience. A Windows build using a GGUF + CPU fallback is on the roadmap for Q4 2026.
Will the model update itself?
Model upgrades are opt-in. When a new drafting model ships, you see a prompt, read what changed, and choose to download or stay on the current version. Transcription and drafting are never silently hot-swapped mid-practice.