Skip to content

Methodology

How we score AI therapy scribes — the rubric, the evidence rules, and the independence policy behind every rating on TherapyScribes. Last revised June 15, 2026.

1. Scoring rubric

Six weighted dimensions, totaling 100. A tool's editorial score is the weighted sum mapped to a 0–10 scale. We publish the per-dimension contribution on every scribe page.

Scoring rubric weights for AI therapy scribes
DimensionWeightWhat we measure
Clinical note quality35%Hands-on testing on a set of representative therapy sessions across SOAP, DAP, BIRP and GIRP formats. We assess factual accuracy, speaker attribution on multi-party sessions, risk-language calibration, and rate of hallucinated quotes or fabricated history.
Compliance posture20%HIPAA + BAA, SOC 2 Type II, GDPR, and 42 CFR Part 2 awareness for SUD-program use. Audio retention policy, no-training-on-customer-data position, and subprocessor disclosure all factor in.
EHR / workflow integration15%Depth of integration into the EHRs therapists actually use — SimplePractice, TherapyNotes, Jane, Alma, Headway, Valant. Native integration > browser extension > copy-paste.
Pricing transparency10%Published pricing wins over sales-led-only. Free tier and meaningful trial periods score higher. We penalize fragmented multi-channel pricing or hidden enterprise minimums.
Multi-language / format breadth10%Languages of session capture and output; template breadth across therapy modalities (CBT, DBT, EMDR, couples, family, group).
Support & roadmap10%Documentation quality, response time, customer-facing roadmap, and operating-history signal.
Total100%

Clinical note quality 35%

  • Hallucination rate (fabricated quotes, dates, or history) per 100 notes
  • Speaker attribution accuracy on couples and family sessions
  • Risk-language calibration on suicidality and abuse disclosures
  • Adherence to the chosen note format (SOAP / DAP / BIRP / GIRP)

Compliance posture 20%

  • Signed BAA available on the lowest paid tier
  • Independent SOC 2 Type II report (not just Type I)
  • Default audio retention of 0 seconds or explicit user control
  • Published subprocessor list with notification on change

EHR / workflow integration 15%

  • Native two-way sync (note + appointment) vs one-way push
  • Coverage of the top six therapy EHRs
  • Time-to-first-note from a cold session in minutes

Pricing transparency 10%

  • Per-seat price published on the public site
  • Free tier or 14+ day trial without a credit card
  • No usage caps that are not stated on the pricing page

Multi-language / format breadth 10%

  • Supported capture languages and output languages (counted separately)
  • Built-in templates for CBT, DBT, EMDR, couples, family, and group
  • User-editable template library with versioning

Support & roadmap 10%

  • Public changelog updated within the last 60 days
  • Median support response under 24 business hours
  • Operating history (years shipping the product)

2. Score bands

How the 0–10 editorial score maps to a recommendation.

Editorial score bands
ScoreLabelWhat it means
9.0 – 10.0Best in classTested, leading on at least three rubric dimensions, no material compliance gap.
8.0 – 8.9Strong pickTested or extensively documented, no compliance gap, weak on at most one dimension.
7.0 – 7.9Solid optionMeets the bar on clinical quality and compliance, lags on integrations or pricing transparency.
6.0 – 6.9ConditionalUse only if a specific feature fits your workflow; one rubric dimension is materially weak.
Below 6.0Not recommendedMaterial clinical or compliance gap. We explain the specific failure in the verdict.

3. Tested vs Provisional

A tool is labeled Tested only if we have run it against our reproducible therapy-session set ourselves. Provisional ratings reflect publicly sourced facts and our reading of the product without hands-on clinical testing — directional, not verified. Provisional ratings are capped at 8.5 until tested.

4. Evidence rules

  • Primary sources only

    Every pricing, compliance, integration, and feature fact must come from the vendor's own public materials — pricing page, trust center, signed BAA template, security whitepaper, status page, or product documentation. Third-party blog summaries do not count as a source.

  • Date-stamped and re-verified

    Every fact carries a last-verified date. We re-check pricing and compliance facts at least once per quarter and on any visible vendor change. Stale facts are flagged in the UI.

  • No guessing, no rounding up

    When a vendor does not disclose a fact, we render an em-dash (—) rather than infer. Partial compliance is marked partial, not yes.

  • Reproducible test set

    Hands-on testing uses the same fixed set of de-identified mock therapy sessions across every tool — individual CBT intake, couples session with conflict, group DBT skills, EMDR processing, and a crisis disclosure. We rotate the set annually.

  • Citations on every claim

    Each fact on a scribe page links to a numbered source in the per-page Sources & references section. If a claim has no source, it does not appear in the fact table.

5. Testing protocol

For every tool labeled Tested, we run the same end-to-end protocol:

  1. Create a fresh account on the lowest paid tier that includes a BAA.
  2. Run five mock sessions from the fixed test set (individual CBT intake, couples conflict, group DBT skills, EMDR processing, crisis disclosure).
  3. Generate notes in SOAP, DAP, BIRP, and GIRP and compare against a clinician-authored reference note.
  4. Score hallucinations, omissions, mis-attribution, and risk-language calibration on a per-note basis.
  5. Push at least one note into a connected EHR (SimplePractice or TherapyNotes) and measure round-trip time.
  6. Capture screenshots and timestamps; archive everything to a per-tool evidence folder.

6. Independence

Vendors do not see editorial reviews before publication. Reviewers disclose any prior employment with a vendor and recuse from that tool's rating.

7. Verified clinician reviews

Practitioner reviews are email-verified, displayed separately from the editorial score, and never folded into the score itself. We moderate to remove vendor-submitted reviews and to verify the reviewer holds the license they claim. Reviews from unconfirmed email addresses are not displayed.

8. Corrections policy

If a fact on this site is wrong, we want it fixed. Supported corrections are applied within five business days; the page footer's last-verified date is updated and a brief changelog entry is added on the affected scribe page.

9. Frequently asked questions

Do vendors see reviews before publication?

No. Vendors do not see editorial reviews before publication. They may submit factual corrections after publication, which are evaluated against primary sources.

What is the difference between Tested and Provisional?

Tested means we have run the tool against our reproducible therapy-session set ourselves. Provisional means the rating is based on publicly sourced facts and product documentation without hands-on clinical testing — directional, not verified.

How do verified clinician reviews affect the editorial score?

They do not. Verified clinician reviews are displayed separately and never folded into the editorial score. We surface both so readers can see where practitioner experience diverges from our editorial view.

How do you handle vendor disputes about a fact?

If a published primary source supports the dispute, we update the fact, the source link, and the last-verified date. If it is not supported, we note the disagreement openly on the page.

How often is the rubric itself revised?

The rubric is reviewed annually and any time a regulatory change (for example a new HIPAA enforcement posture or a state-level AI-in-health rule) materially shifts what therapists should require from a scribe.

Want to see the rubric in action? Read our side-by-side comparison or jump to the full ranking.