Topic · Therapy-note AI without cloud

Yes, there is a therapy-note AI that doesn't use the cloud — here's how it works

The common reaction to "on-device AI therapy scribe" is "that's not possible yet." As of 2025 it became possible, and as of 2026 it is the only honest answer to the category of clinician asking this question.

TL;DR

Every AI therapy scribe sold today — Mentalyc, Upheal, Blueprint, Supanote, Freed, CliniScripts — sends session audio to a server. It's not a privacy oversight; it's what "cloud scribe" means. TherapyDraft is the existence-proof that you no longer have to. Two pieces of commodity open-source infrastructure (whisper.cpp for transcription, Qwen 2.5 14B via Apple's MLX for drafting), plus Apple Silicon unified-memory performance, plus the macOS sandbox, together produce a clinical-quality draft in 2–5 minutes without a single packet leaving your Mac.

Why most clinicians assume this doesn't exist

Until recently, they were right. Speech-to-text good enough for emotional, overlapping, soft-spoken therapy dialogue required a large cloud model. Note drafting good enough for a clinician to edit rather than rewrite required an even larger one. Running both on a laptop was absurd — consumer GPUs had 8 GB of VRAM, and a usable model at full precision needed 50 GB or more.

Three things changed between 2023 and 2026:

Apple unified memory. M-series Macs share RAM between CPU and GPU, so a 14 B-parameter model occupies system memory directly rather than requiring a dedicated GPU card. A 24 GB M3 Pro has more addressable GPU memory than a $1,500 Windows laptop.
4-bit quantization. Techniques like AWQ and GPTQ compress a 14 B-parameter model from ~28 GB down to ~8 GB with ≤2% quality loss on instruction-following benchmarks. The Qwen 2.5 family specifically publishes official 4-bit weights. The model goes from "cloud-only" to "fits in RAM."
whisper.cpp + Core ML acceleration. whisper.cpp runs OpenAI's Whisper models on CPU, and since 2024 with a Core ML path that lights up the Apple Neural Engine. Real-time factor on M2 is under 1×, meaning a 50-minute session transcribes in less than 50 minutes — in practice, 3–4 minutes.

None of these are proprietary to us. They are open-source components any engineer can audit or reproduce. What we did was stitch them together with a clinically-aware prompt template, a tamper-evident log, and a macOS distribution that enforces the no-socket guarantee.

What "without cloud" actually looks like on your machine

When you drop a session audio file into TherapyDraft:

The file stays on disk in ~/Library/Application Support/TherapyDraft/audio/. It is not copied to a vendor server. It is not uploaded anywhere. It is readable only by your user account.
The transcription pass reads that file, runs whisper.cpp on the Apple Silicon GPU, and writes the transcript to an adjacent file. No HTTP requests. No DNS lookups.
The drafting pass reads the transcript, runs Qwen 14B via MLX locally, and writes the draft to a third adjacent file. Again, no network activity.
The tamper-evident JSON-Lines inference log appends a row for each pass (model ID, prompt hash, output hash, timestamp, device ID). The log is local; no one but you can read it.

You can run sudo lsof -i -n -p $(pgrep TherapyDraft) during any of this and see zero open sockets besides a Stripe license check at startup. We're not arguing about it — you can check.

Is the output really clinical-quality without cloud-scale parameters?

For SOAP/DAP/BIRP/GIRP drafting from a transcript, a 14 B-parameter model with a well-crafted prompt and five of the clinician's own example notes as few-shot anchors produces output that is, in our internal benchmarking, a ~1-minute edit away from sendable. A 70 B cloud model will sometimes produce a draft that needs only 30 seconds of edit instead of 60 — but neither number is "zero," and for many clinicians the 30 seconds of time saved does not justify shipping the audio to a vendor.

The important honesty: AI scribe output of any architecture is a first draft. Reading and correcting it is a clinical-responsibility step, not a workflow inconvenience. See the pricing-comparison page for the side-by-side on cost-per-session and quality trade-offs.

What you still need to bring

A recent M-series Mac (M1 or later) with 16 GB of RAM as the recommended floor; 24 GB if you want sub-3-minute turnarounds. macOS 14 Sonoma or later. About 10 GB of free disk for the bundled models. No special audio hardware — the Mac's built-in microphone handles in-person sessions fine; a USB lavalier helps for the first few weeks while you calibrate volume.