Methodology

How we score AI therapy scribes — the rubric, the evidence rules, and the independence policy behind every rating on TherapyScribes. Last revised June 15, 2026.

1. Scoring rubric

Six weighted dimensions, totaling 100. A tool's editorial score is the weighted sum mapped to a 0–10 scale. We publish the per-dimension contribution on every scribe page.

Scoring rubric weights for AI therapy scribes
Dimension	Weight	What we measure
Clinical note quality	35%	Hands-on testing on a set of representative therapy sessions across SOAP, DAP, BIRP and GIRP formats. We assess factual accuracy, speaker attribution on multi-party sessions, risk-language calibration, and rate of hallucinated quotes or fabricated history.
Compliance posture	20%	HIPAA + BAA, SOC 2 Type II, GDPR, and 42 CFR Part 2 awareness for SUD-program use. Audio retention policy, no-training-on-customer-data position, and subprocessor disclosure all factor in.
EHR / workflow integration	15%	Depth of integration into the EHRs therapists actually use — SimplePractice, TherapyNotes, Jane, Alma, Headway, Valant. Native integration > browser extension > copy-paste.
Pricing transparency	10%	Published pricing wins over sales-led-only. Free tier and meaningful trial periods score higher. We penalize fragmented multi-channel pricing or hidden enterprise minimums.
Multi-language / format breadth	10%	Languages of session capture and output; template breadth across therapy modalities (CBT, DBT, EMDR, couples, family, group).
Support & roadmap	10%	Documentation quality, response time, customer-facing roadmap, and operating-history signal.
Total	100%

Clinical note quality — 35%

Hallucination rate (fabricated quotes, dates, or history) per 100 notes
Speaker attribution accuracy on couples and family sessions
Risk-language calibration on suicidality and abuse disclosures
Adherence to the chosen note format (SOAP / DAP / BIRP / GIRP)

Compliance posture — 20%

Signed BAA available on the lowest paid tier
Independent SOC 2 Type II report (not just Type I)
Default audio retention of 0 seconds or explicit user control
Published subprocessor list with notification on change

EHR / workflow integration — 15%

Native two-way sync (note + appointment) vs one-way push
Coverage of the top six therapy EHRs
Time-to-first-note from a cold session in minutes

Pricing transparency — 10%

Per-seat price published on the public site
Free tier or 14+ day trial without a credit card
No usage caps that are not stated on the pricing page

Multi-language / format breadth — 10%

Supported capture languages and output languages (counted separately)
Built-in templates for CBT, DBT, EMDR, couples, family, and group
User-editable template library with versioning

Support & roadmap — 10%

Public changelog updated within the last 60 days
Median support response under 24 business hours
Operating history (years shipping the product)

2. Score bands

How the 0–10 editorial score maps to a recommendation.

Editorial score bands
Score	Label	What it means
9.0 – 10.0	Best in class	Tested, leading on at least three rubric dimensions, no material compliance gap.
8.0 – 8.9	Strong pick	Tested or extensively documented, no compliance gap, weak on at most one dimension.
7.0 – 7.9	Solid option	Meets the bar on clinical quality and compliance, lags on integrations or pricing transparency.
6.0 – 6.9	Conditional	Use only if a specific feature fits your workflow; one rubric dimension is materially weak.
Below 6.0	Not recommended	Material clinical or compliance gap. We explain the specific failure in the verdict.

3. Tested vs Provisional

A tool is labeled Tested only if we have run it against our reproducible therapy-session set ourselves. Provisional ratings reflect publicly sourced facts and our reading of the product without hands-on clinical testing — directional, not verified. Provisional ratings are capped at 8.5 until tested.

4. Evidence rules

Primary sources only
Every pricing, compliance, integration, and feature fact must come from the vendor's own public materials — pricing page, trust center, signed BAA template, security whitepaper, status page, or product documentation. Third-party blog summaries do not count as a source.
Date-stamped and re-verified
Every fact carries a last-verified date. We re-check pricing and compliance facts at least once per quarter and on any visible vendor change. Stale facts are flagged in the UI.
No guessing, no rounding up
When a vendor does not disclose a fact, we render an em-dash (—) rather than infer. Partial compliance is marked partial, not yes.
Reproducible test set
Hands-on testing uses the same fixed set of de-identified mock therapy sessions across every tool — individual CBT intake, couples session with conflict, group DBT skills, EMDR processing, and a crisis disclosure. We rotate the set annually.
Citations on every claim
Each fact on a scribe page links to a numbered source in the per-page Sources & references section. If a claim has no source, it does not appear in the fact table.

5. Testing protocol

For every tool labeled Tested, we run the same end-to-end protocol:

Create a fresh account on the lowest paid tier that includes a BAA.
Run five mock sessions from the fixed test set (individual CBT intake, couples conflict, group DBT skills, EMDR processing, crisis disclosure).
Generate notes in SOAP, DAP, BIRP, and GIRP and compare against a clinician-authored reference note.
Score hallucinations, omissions, mis-attribution, and risk-language calibration on a per-note basis.
Push at least one note into a connected EHR (SimplePractice or TherapyNotes) and measure round-trip time.
Capture screenshots and timestamps; archive everything to a per-tool evidence folder.

6. Independence

Vendors do not see editorial reviews before publication. Reviewers disclose any prior employment with a vendor and recuse from that tool's rating.

7. Verified clinician reviews

Practitioner reviews are email-verified, displayed separately from the editorial score, and never folded into the score itself. We moderate to remove vendor-submitted reviews and to verify the reviewer holds the license they claim. Reviews from unconfirmed email addresses are not displayed.

8. Corrections policy

If a fact on this site is wrong, we want it fixed. Supported corrections are applied within five business days; the page footer's last-verified date is updated and a brief changelog entry is added on the affected scribe page.

9. Frequently asked questions

Do vendors see reviews before publication?

No. Vendors do not see editorial reviews before publication. They may submit factual corrections after publication, which are evaluated against primary sources.

What is the difference between Tested and Provisional?

Tested means we have run the tool against our reproducible therapy-session set ourselves. Provisional means the rating is based on publicly sourced facts and product documentation without hands-on clinical testing — directional, not verified.

How do verified clinician reviews affect the editorial score?

They do not. Verified clinician reviews are displayed separately and never folded into the editorial score. We surface both so readers can see where practitioner experience diverges from our editorial view.

How do you handle vendor disputes about a fact?

If a published primary source supports the dispute, we update the fact, the source link, and the last-verified date. If it is not supported, we note the disagreement openly on the page.

How often is the rubric itself revised?

The rubric is reviewed annually and any time a regulatory change (for example a new HIPAA enforcement posture or a state-level AI-in-health rule) materially shifts what therapists should require from a scribe.

Want to see the rubric in action? Read our side-by-side comparison or jump to the full ranking.