Topic · On-device therapy note generator

On-device therapy note generator for M-series Macs

The question every technical buyer asks first is "how long does one note take on my Mac." Here's the honest table, chip by chip, measured on a real 50-minute session transcript.

TL;DR

TherapyDraft drafts a therapy note entirely on the clinician's Mac — audio in, SOAP/DAP/BIRP/GIRP out, no network socket involved. On an M2 with 16 GB of RAM, a 50-minute session transcribes in ~3:40 and drafts the note in ~50 seconds, for about 4.5 minutes wall-clock end to end. An M4 Pro finishes the same workload in under 2 minutes. Install footprint is roughly 9.5 GB on first run (app binary + whisper large-v3 weights + Qwen 2.5 14B in 4-bit). That's the whole picture.

What "on-device" means here

When a tool claims to be on-device, the important question is "which device." TherapyDraft runs the two heaviest parts of the pipeline — speech-to-text and note drafting — on the Apple Silicon GPU of the clinician's own Mac. Nothing in the audio-to-draft path touches a server we control, a server a third party controls, or even the public internet. The only outbound traffic from the app is a Stripe license check at startup and an anonymous version ping, both governed by the macOS network entitlement allow-list documented on the private AI therapy scribe page.

Runtime architecture

The pipeline is two stages, both running natively on Apple Silicon:

Transcription. whisper.cpp running the large-v3 model in a 5-bit Core-ML-accelerated build. On Apple Silicon this is the fastest well-audited speech-to-text available that still ships clinical-quality transcription of soft-spoken, emotionally-charged speech.
Drafting. Qwen 2.5 14B Instruct, quantized to 4-bit, running on the MLX runtime. The draft prompt is a short template plus five of your own example notes (the clinician's voice, not a generic medical tone) and the transcribed session. Output is structured into SOAP, DAP, BIRP, or GIRP as selected.

The model is a separately-downloadable artifact, so if the Qwen licensing ever shifts or a better 14B ships, you swap the weights in the settings panel without reinstalling the app.

Latency by chip (50-minute session, 600-token draft)

Measured on clean installs with the default large-v3 transcription and the default 14B drafting model. Timings are wall-clock, from "drag the audio file in" to "draft ready to copy."

Chip	Unified RAM	Transcribe (50 min audio)	Draft (600 tokens)	Total
M1	16 GB	~5:10	~1:25	~6:35
M2	16 GB	~3:40	~0:50	~4:30
M2 Pro	16 GB	~2:55	~0:35	~3:30
M3	24 GB	~2:40	~0:32	~3:12
M3 Pro	24 GB	~2:10	~0:24	~2:34
M4	24 GB	~1:55	~0:21	~2:16
M4 Pro	24 GB	~1:30	~0:17	~1:47

The M1 at 8 GB of RAM is technically supported but marginal — Qwen 14B in 4-bit exceeds the addressable pool once the OS and the whisper decoder are resident, so we recommend 16 GB as a floor. For the 8-GB M1, a compatible 8B model (Llama 3.1 8B in 4-bit, bundled as an alternative) brings the full draft in under five minutes with a small quality trade.

Install footprint

First run downloads all artifacts from a content-addressed CDN. After that everything lives locally:

App binary — 48 MB.
whisper large-v3 5-bit Core-ML weights — 1.1 GB.
Qwen 2.5 14B Instruct 4-bit MLX weights — 8.2 GB.
Per-session data (audio + transcript + draft, typical) — 35–90 MB, configurable retention.

All of it lives in ~/Library/Application Support/TherapyDraft, encrypted at rest with keys managed by the macOS keychain. Uninstall deletes the directory and the keychain entries; no residue remains.

Why Apple Silicon specifically

Apple's unified-memory architecture is the thing that makes clinical-quality 14B inference practical on a consumer laptop. A 14B parameter model in 4-bit occupies about 8 GB of memory; on a discrete-GPU Windows machine, fitting that into a 12 GB VRAM card is tight and the CPU still needs its own RAM. On M-series, the same 8 GB is just "8 GB of the system memory," and the GPU reads it directly. That's why the same model runs ~3× faster on an M3 Pro than on a comparably-priced x86 laptop.

Windows support is planned for a later release (see FAQ); it will ship with GGUF + CPU inference fallbacks, with a corresponding latency caveat.