Methodology
How we score AI therapy scribes — the rubric, the evidence rules, and the independence policy behind every rating on TherapyScribes. Last revised June 15, 2026.
1. Scoring rubric
Six weighted dimensions, totaling 100. A tool's editorial score is the weighted sum mapped to a 0–10 scale. We publish the per-dimension contribution on every scribe page.
| Dimension | Weight | What we measure |
|---|---|---|
| Clinical note quality | 35% | Hands-on testing on a set of representative therapy sessions across SOAP, DAP, BIRP and GIRP formats. We assess factual accuracy, speaker attribution on multi-party sessions, risk-language calibration, and rate of hallucinated quotes or fabricated history. |
| Compliance posture | 20% | HIPAA + BAA, SOC 2 Type II, GDPR, and 42 CFR Part 2 awareness for SUD-program use. Audio retention policy, no-training-on-customer-data position, and subprocessor disclosure all factor in. |
| EHR / workflow integration | 15% | Depth of integration into the EHRs therapists actually use — SimplePractice, TherapyNotes, Jane, Alma, Headway, Valant. Native integration > browser extension > copy-paste. |
| Pricing transparency | 10% | Published pricing wins over sales-led-only. Free tier and meaningful trial periods score higher. We penalize fragmented multi-channel pricing or hidden enterprise minimums. |
| Multi-language / format breadth | 10% | Languages of session capture and output; template breadth across therapy modalities (CBT, DBT, EMDR, couples, family, group). |
| Support & roadmap | 10% | Documentation quality, response time, customer-facing roadmap, and operating-history signal. |
| Total | 100% |
Clinical note quality — 35%
- Hallucination rate (fabricated quotes, dates, or history) per 100 notes
- Speaker attribution accuracy on couples and family sessions
- Risk-language calibration on suicidality and abuse disclosures
- Adherence to the chosen note format (SOAP / DAP / BIRP / GIRP)
Compliance posture — 20%
- Signed BAA available on the lowest paid tier
- Independent SOC 2 Type II report (not just Type I)
- Default audio retention of 0 seconds or explicit user control
- Published subprocessor list with notification on change
EHR / workflow integration — 15%
- Native two-way sync (note + appointment) vs one-way push
- Coverage of the top six therapy EHRs
- Time-to-first-note from a cold session in minutes
Pricing transparency — 10%
- Per-seat price published on the public site
- Free tier or 14+ day trial without a credit card
- No usage caps that are not stated on the pricing page
Multi-language / format breadth — 10%
- Supported capture languages and output languages (counted separately)
- Built-in templates for CBT, DBT, EMDR, couples, family, and group
- User-editable template library with versioning
Support & roadmap — 10%
- Public changelog updated within the last 60 days
- Median support response under 24 business hours
- Operating history (years shipping the product)
2. Score bands
How the 0–10 editorial score maps to a recommendation.
| Score | Label | What it means |
|---|---|---|
| 9.0 – 10.0 | Best in class | Tested, leading on at least three rubric dimensions, no material compliance gap. |
| 8.0 – 8.9 | Strong pick | Tested or extensively documented, no compliance gap, weak on at most one dimension. |
| 7.0 – 7.9 | Solid option | Meets the bar on clinical quality and compliance, lags on integrations or pricing transparency. |
| 6.0 – 6.9 | Conditional | Use only if a specific feature fits your workflow; one rubric dimension is materially weak. |
| Below 6.0 | Not recommended | Material clinical or compliance gap. We explain the specific failure in the verdict. |
3. Tested vs Provisional
A tool is labeled Tested only if we have run it against our reproducible therapy-session set ourselves. Provisional ratings reflect publicly sourced facts and our reading of the product without hands-on clinical testing — directional, not verified. Provisional ratings are capped at 8.5 until tested.
4. Evidence rules
Primary sources only
Every pricing, compliance, integration, and feature fact must come from the vendor's own public materials — pricing page, trust center, signed BAA template, security whitepaper, status page, or product documentation. Third-party blog summaries do not count as a source.
Date-stamped and re-verified
Every fact carries a last-verified date. We re-check pricing and compliance facts at least once per quarter and on any visible vendor change. Stale facts are flagged in the UI.
No guessing, no rounding up
When a vendor does not disclose a fact, we render an em-dash (—) rather than infer. Partial compliance is marked partial, not yes.
Reproducible test set
Hands-on testing uses the same fixed set of de-identified mock therapy sessions across every tool — individual CBT intake, couples session with conflict, group DBT skills, EMDR processing, and a crisis disclosure. We rotate the set annually.
Citations on every claim
Each fact on a scribe page links to a numbered source in the per-page Sources & references section. If a claim has no source, it does not appear in the fact table.
5. Testing protocol
For every tool labeled Tested, we run the same end-to-end protocol:
- Create a fresh account on the lowest paid tier that includes a BAA.
- Run five mock sessions from the fixed test set (individual CBT intake, couples conflict, group DBT skills, EMDR processing, crisis disclosure).
- Generate notes in SOAP, DAP, BIRP, and GIRP and compare against a clinician-authored reference note.
- Score hallucinations, omissions, mis-attribution, and risk-language calibration on a per-note basis.
- Push at least one note into a connected EHR (SimplePractice or TherapyNotes) and measure round-trip time.
- Capture screenshots and timestamps; archive everything to a per-tool evidence folder.
6. Independence
Vendors do not see editorial reviews before publication. Reviewers disclose any prior employment with a vendor and recuse from that tool's rating.
7. Verified clinician reviews
Practitioner reviews are email-verified, displayed separately from the editorial score, and never folded into the score itself. We moderate to remove vendor-submitted reviews and to verify the reviewer holds the license they claim. Reviews from unconfirmed email addresses are not displayed.
8. Corrections policy
If a fact on this site is wrong, we want it fixed. Supported corrections are applied within five business days; the page footer's last-verified date is updated and a brief changelog entry is added on the affected scribe page.
9. Frequently asked questions
Do vendors see reviews before publication?
No. Vendors do not see editorial reviews before publication. They may submit factual corrections after publication, which are evaluated against primary sources.
What is the difference between Tested and Provisional?
Tested means we have run the tool against our reproducible therapy-session set ourselves. Provisional means the rating is based on publicly sourced facts and product documentation without hands-on clinical testing — directional, not verified.
How do verified clinician reviews affect the editorial score?
They do not. Verified clinician reviews are displayed separately and never folded into the editorial score. We surface both so readers can see where practitioner experience diverges from our editorial view.
How do you handle vendor disputes about a fact?
If a published primary source supports the dispute, we update the fact, the source link, and the last-verified date. If it is not supported, we note the disagreement openly on the page.
How often is the rubric itself revised?
The rubric is reviewed annually and any time a regulatory change (for example a new HIPAA enforcement posture or a state-level AI-in-health rule) materially shifts what therapists should require from a scribe.
Want to see the rubric in action? Read our side-by-side comparison or jump to the full ranking.