Topic · Therapy-note AI without cloud

Yes, there is a therapy-note AI that doesn't use the cloud — here's how it works

The common reaction to "on-device AI therapy scribe" is "that's not possible yet." As of 2025 it became possible, and as of 2026 it is the only honest answer to the category of clinician asking this question.

TL;DR

Every AI therapy scribe sold today — Mentalyc, Upheal, Blueprint, Supanote, Freed, CliniScripts — sends session audio to a server. It's not a privacy oversight; it's what "cloud scribe" means. TherapyDraft is the existence-proof that you no longer have to. Two pieces of commodity open-source infrastructure (whisper.cpp for transcription, Qwen 2.5 14B via Apple's MLX for drafting), plus Apple Silicon unified-memory performance, plus the macOS sandbox, together produce a clinical-quality draft in 2–5 minutes without a single packet leaving your Mac.

Why most clinicians assume this doesn't exist

Until recently, they were right. Speech-to-text good enough for emotional, overlapping, soft-spoken therapy dialogue required a large cloud model. Note drafting good enough for a clinician to edit rather than rewrite required an even larger one. Running both on a laptop was absurd — consumer GPUs had 8 GB of VRAM, and a usable model at full precision needed 50 GB or more.

Three things changed between 2023 and 2026:

  1. Apple unified memory. M-series Macs share RAM between CPU and GPU, so a 14 B-parameter model occupies system memory directly rather than requiring a dedicated GPU card. A 24 GB M3 Pro has more addressable GPU memory than a $1,500 Windows laptop.
  2. 4-bit quantization. Techniques like AWQ and GPTQ compress a 14 B-parameter model from ~28 GB down to ~8 GB with ≤2% quality loss on instruction-following benchmarks. The Qwen 2.5 family specifically publishes official 4-bit weights. The model goes from "cloud-only" to "fits in RAM."
  3. whisper.cpp + Core ML acceleration. whisper.cpp runs OpenAI's Whisper models on CPU, and since 2024 with a Core ML path that lights up the Apple Neural Engine. Real-time factor on M2 is under 1×, meaning a 50-minute session transcribes in less than 50 minutes — in practice, 3–4 minutes.

None of these are proprietary to us. They are open-source components any engineer can audit or reproduce. What we did was stitch them together with a clinically-aware prompt template, a tamper-evident log, and a macOS distribution that enforces the no-socket guarantee.

What "without cloud" actually looks like on your machine

When you drop a session audio file into TherapyDraft:

You can run sudo lsof -i -n -p $(pgrep TherapyDraft) during any of this and see zero open sockets besides a Stripe license check at startup. We're not arguing about it — you can check.

Is the output really clinical-quality without cloud-scale parameters?

For SOAP/DAP/BIRP/GIRP drafting from a transcript, a 14 B-parameter model with a well-crafted prompt and five of the clinician's own example notes as few-shot anchors produces output that is, in our internal benchmarking, a ~1-minute edit away from sendable. A 70 B cloud model will sometimes produce a draft that needs only 30 seconds of edit instead of 60 — but neither number is "zero," and for many clinicians the 30 seconds of time saved does not justify shipping the audio to a vendor.

The important honesty: AI scribe output of any architecture is a first draft. Reading and correcting it is a clinical-responsibility step, not a workflow inconvenience. See the pricing-comparison page for the side-by-side on cost-per-session and quality trade-offs.

What you still need to bring

A recent M-series Mac (M1 or later) with 16 GB of RAM as the recommended floor; 24 GB if you want sub-3-minute turnarounds. macOS 14 Sonoma or later. About 10 GB of free disk for the bundled models. No special audio hardware — the Mac's built-in microphone handles in-person sessions fine; a USB lavalier helps for the first few weeks while you calibrate volume.

Related questions

Is "local AI therapy notes" the same as "on-device therapy note generator"?

Yes — they're two phrasings of the same architectural choice. "Local" is the more clinician-facing term; "on-device" is the more engineering-facing term. Both mean the same thing here.

Why does no cloud scribe also offer a local mode?

Because their business model and their stack are built around cloud inference. Adding a local mode would require rewriting the transcription and drafting layer, re-QA-ing every note-format output, and shipping a signed native binary on every OS. For most of them, the cost of that rewrite exceeds the expected revenue from the privacy-first segment.

Open source alternative?

The components (whisper.cpp, MLX, Qwen, Llama) are open source; a packaged, signed, clinically-framed product with a tamper-evident attestation log and four note-format outputs is not. If you are technically inclined, you can assemble a functional equivalent yourself in a weekend — many clinicians have. TherapyDraft is the "paid to not spend the weekend" option with regulatory-grade documentation.

Further reading