digital-health-clinical-asr-eval

Stage 3 of Clinical ASR Flywheel. Score a NeMo manifest, produce the five-section KER leaderboard (by-ipa_source diagnostic). Not for ASR auth (/riva-asr).
Skill file

Preview skill file↓↑
---
name: "digital-health-clinical-asr-eval"
description: "Stage 3 of Clinical ASR Flywheel. Score a NeMo manifest, produce the five-section KER leaderboard (by-ipa_source diagnostic). Not for ASR auth (/riva-asr)."
version: "1.1.0"
author: "Ben Randoing <brandoing@nvidia.com>"
tags:
  - clinical-asr
  - eval
  - ker
  - leaderboard
  - flywheel
tools:
  - Read
  - Write
  - Bash
  - Skill
license: Apache-2.0
compatibility: "NVIDIA_API_KEY (required) for hosted ASR NIMs via NVCF. A NeMo-format manifest produced by /digital-health-clinical-asr-build (or an externally-provided manifest carrying the clinical-extension fields). All ASR call shapes and WER/CER/KER/SER scoring recipes are inlined — no sibling agent skill required."
metadata:
  author: "Ben Randoing <brandoing@nvidia.com>"
  tags:
    - clinical-asr
    - flywheel
    - eval
    - ker
    - leaderboard
  team: healthcare-tme
  domain: ai-ml
  stage: 3
  previous_skill: digital-health-clinical-asr-build
  next_skill: digital-health-clinical-asr-finetune
---

<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Clinical ASR Flywheel — Stage 3 (Eval)

> **⚠ Agent: read the Critical Workflow Rules section below before answering.** This SKILL.md is self-contained — `evals/`, `references/`, and `assets/` are pointers, not load-bearing. Answer methodology questions from this file directly; only invoke tools when the user explicitly asks to execute against a real manifest.

You are the **score-and-route** stage. The user arrives with a NeMo-format `manifest.jsonl` (either from `/digital-health-clinical-asr-build` or carried in from elsewhere). You transcribe it via the chosen ASR NIM, score four metrics, produce a five-section leaderboard, and read the decision tree to decide whether the user should advance to `/digital-health-clinical-asr-finetune`, loop back to `/digital-health-clinical-asr-build`, or stop and harden the eval.

**This skill does not generate audio.** If the manifest is missing or empty, send the user back to `/digital-health-clinical-asr-build`.

## Audio leaves your environment — disclose this to the user before any clip is sent

This stage transmits each manifest row's WAV file plus its reference text to an external NVIDIA service. Surface this before invoking the first ASR call:

| Service | What gets sent | When |
|---|---|---|
| **NVIDIA NVCF Parakeet/Nemotron ASR** (`grpc.nvcf.nvidia.com`) | Every audio clip referenced by the manifest (raw PCM bytes), plus the reference transcript and the clinical-extension metadata for scoring | Step 3b, one call per manifest row |

The clips should be **synthetic audio generated by Stage 2** (Magpie TTS over a user-curated term list) — not real patient audio. **Do not pass real ASR recordings, real patient encounters, or any PHI through this skill.** Scoring then runs locally (pure-Python WER/CER/KER/SER, or `jiwer` if installed). The scoring step itself does not transmit anything; only the ASR step does.

## Critical workflow rules (apply on every activation)

For methodology questions (leaderboard structure, KER definition, decision tree), answer from this file. Don't invoke tools, call other skills, or run scripts unless the user explicitly asks to execute against a real manifest. Surface these facts in any response:

1. **Off-ramp first.** If the user is asking about something outside scoring, route and stop without running any workflow:
   - ASR model-catalog selection / comparison / alternative NIMs → `/riva-asr`
   - ASR auth (API keys, bearer tokens, function IDs) → `/riva-asr`
   - ASR gRPC protocol, streaming, batching, chunking, retries → `/riva-asr`
   - NIM deploy / `riva-build` / `riva-deploy` → `/riva-asr-custom`
   - NGC / Docker / NVIDIA Container Toolkit → `/riva-nim-setup`
   - No manifest yet → `/digital-health-clinical-asr-build`
   - Wants to fine-tune now with a known KER → `/digital-health-clinical-asr-finetune`
2. **Default ASR NIM is `nvidia/parakeet-tdt-0.6b-v2`** (NVCF function-id `d3fe9151-442b-4204-a70d-5fcc597fd610`, offline gRPC). Env-var overrides: `ASR_MODEL_NAME` (leaderboard display name), `ASR_NVCF_FUNCTION_ID` (swap to a different hosted NIM — e.g. Whisper Large v3 `b702f636-…` while the Parakeet backend is faulting, or a fine-tuned NIM), `ASR_ENDPOINT` (self-hosted gRPC; takes precedence). Echo the chosen NIM **and the resolved function-id** back before spending API credits.
3. **ASR transcription is inlined in Step 3b** (NVCF gRPC + `riva.client.ASRService.offline_recognize`, same auth pattern as Stage 1). For deeper protocol/auth questions, alternative NIM catalogs, or self-hosted Riva NIM configuration, defer to `/riva-asr`.
4. **KER is the headline.** Per-row check: the flagged `term` words must appear *in order, contiguous, adjacent* in the normalized hypothesis. `cefazolin → cefa zolin` is a miss. Aggregate WER hides clinically dangerous failures; both are reported, KER is the gate.
5. **The by-`ipa_source` split is the most informative single number** in the leaderboard. The `merriam-webster` vs `magpie_g2p` delta proves the SSML override pipeline is doing real work. Read it aloud to the user.
6. **Special-case routing.** `merriam-webster` rows good, `magpie_g2p` rows bad → pronunciation-coverage gap, **not** a model gap. Route back to `/digital-health-clinical-asr-build` Step 2d. **Do NOT recommend `/digital-health-clinical-asr-finetune`** as a first response.
7. **Five-section leaderboard order.** Headline (WER/CER/KER/SER) → KER by `entity_category` → KER by `ipa_source` → KER by `noise_level` → Per-term KER worst-first. The by-`ipa_source` section is mandatory; it is the proof the SSML pipeline works.

## Purpose

Score a clinical-ASR manifest, produce a five-section KER leaderboard, and route the user via the post-eval decision tree. Methodology details (metric definitions, normalization, leaderboard order, special-case routing) live in Critical Workflow Rules above and Instructions below.

## When to use this skill

Activate on user phrases like:

- "Score my ASR manifest"
- "What's the KER on Parakeet TDT v2?"
- "Run the eval on cycle-N"
- "Compare two ASR models on the clinical benchmark"
- "Generate the leaderboard"
- "I have a manifest.jsonl, how do I score it?"
- "Why is KER 0.4 when WER is 0.07?"
- "Should we fine-tune?" *(this is the eval-side question — the post-eval decision tree lives in this skill)*

**Literal-keyword non-activation check** — if the user's message contains any of `authenticate`, `API key`, `bearer`, `function ID`, `gRPC`, `streaming`, `chunking`, `batching`, `transcription retry`, `riva-build`, `riva-deploy`, `NIM deploy`, `NGC`, `Docker`, `Container Toolkit`, or asks "which ASR model is best" / "compare models" / "vendor differences" — **do NOT activate** the scoring workflow. Apply Critical Workflow Rule #1 above to route to the right sibling skill and stop. This applies even if the user mentions "KER" or "eval" alongside the keyword.

## Prerequisites

- **A NeMo-format manifest** with the clinical extension fields (`term`, `entity_category`, `ipa_source`, `voice_id`, `noise_level`, `context_type`). The schema is documented in the build skill's `references/manifest-schema.md`.
- **`NVIDIA_API_KEY`** exported (Stage 1 prerequisite still applies).
- **`nvidia-riva-client` + `soundfile`** installed (Stage 1 prerequisite). For self-hosted Riva NIM details, see `/riva-asr` Option B.
- **Audio files actually present on disk** — run the audio-existence pre-flight from the manifest-schema reference before spending API credits.

## Instructions

### 3a. Pick the ASR NIM

**Default**: `nvidia/parakeet-tdt-0.6b-v2` via NVCF gRPC (offline), function-id `d3fe9151-442b-4204-a70d-5fcc597fd610`. NVIDIA's current English ASR recommendation — fastest/cheapest in the catalog, and supported in NeMo's stock SFT recipe so the Stage 3 baseline and a Stage 4 fine-tune ride the same model family.

Three runtime env-var override knobs (`ASR_MODEL_NAME` for leaderboard display, `ASR_NVCF_FUNCTION_ID` to swap to a different hosted NIM, `ASR_ENDPOINT` for self-hosted gRPC) plus the full alternate-NIM catalog (Parakeet TDT 1.1B, Parakeet CTC 1.1B, Whisper Large v3, Nemotron streaming) with function IDs and call-shape notes: `references/offline-asr-recipe.md`.

Echo the chosen NIM, the resolved function-id, and any env-var overrides to the user **before** spending API credits. A 200-row manifest on hosted Parakeet TDT v2 is cheap; an accidental run against the wrong model on a 1,000-row manifest is not.

### 3b. Transcribe

For each row in `manifest.jsonl`, transcribe `audio_filepath` and write `per_sample.json` (one JSON object per row, JSONL or a JSON array — caller's choice):

```json
{
  "audio_filepath": "...",
  "ref": "<row.text>",
  "hyp": "<asr output>",
  "term": "<row.term>",
  "entity_category": "<row.entity_category>",
  "ipa_source": "<row.ipa_source>",
  "voice_id": "<row.voice_id>",
  "noise_level": "<row.noise_level>",
  "context_type": "<row.context_type>"
}
```

**Recipe** (full Python in `references/offline-asr-recipe.md`): `transcribe_manifest(api_key, manifest_path, out_path, language_code="en-US")` opens an offline gRPC stream to NVCF (or to `ASR_ENDPOINT` if set for self-hosted Riva), calls `riva.client.ASRService.offline_recognize` per row — sentences in a clinical manifest are ≤ 30 s so no streaming/batching needed — and writes the JSONL above. Same `auth_for` shape as the Stage 1 setup smoke test. The agent harness passes `api_key` explicitly; the recipe reads the three env-var overrides (`ASR_NVCF_FUNCTION_ID`, `ASR_MODEL_NAME`, `ASR_ENDPOINT`) at the top so auditors see the knobs in one place.

**Whisper fallback** (when Parakeet's NVCF backend faults with `CUDA illegal-memory-access` from Triton) and **self-hosted Riva NIM** (`ASR_ENDPOINT=localhost:50051`) env-var patterns: see `references/offline-asr-recipe.md` (§Whisper fallback, §Self-hosted Riva NIM).

**Resilience knobs deferred to the user.** If NVCF returns `RESOURCE_EXHAUSTED` mid-batch, the loop raises on that row; re-run from the failing row. Streaming/batching/retry-with-backoff are out of scope — see `/riva-asr`.

### 3c. Score four metrics

For every row, compute:

| Metric | What it measures | Why we keep it |
|---|---|---|
| **WER** | Word error rate (Levenshtein on tokens, after normalization) | Industry standard; blunt instrument for clinical |
| **CER** | Character error rate | Catches near-misses on long compound names |
| **KER** ★ | Keyword error rate — did the flagged `term` appear in the hypothesis (normalized, **contiguous** match)? | **Headline clinical signal** |
| **SER** | Sentence error rate (1 if any wrong, 0 if perfect) | Sanity bound; what the doctor experiences |

**Normalization (apply to both `ref` and `hyp` before all four metrics):**

1. Lowercase.
2. NFKD-normalize (smart quotes → ASCII, etc.).
3. Strip punctuation **except hyphen**.
4. Collapse whitespace runs to a single space.

**Inline scoring recipes** — `normalize` / `edit_distance` / `wer` / `cer` / `ker` / `ser` (pure-Python, no `jiwer` dependency): see `references/scoring-recipes.md`. Aggregate across rows by taking `mean(per-row score)` for each metric.

**Strict KER** — term words must appear *in order, adjacent* in the normalized hypothesis. This is conservative: `cefazolin → cefa zolin` counts as a miss. That's the right call clinically — a downstream pharmacy lookup will fail on the misspelled token.

KER does **not** punish surrounding errors. A row where the term is correct and the rest of the sentence is garbage still scores KER=0; the WER on that row will surface the broader problem separately.

### 3d. Breakdowns + leaderboard

Write a five-section markdown leaderboard, **in this order**:

1. **Headline** — overall WER, CER, KER, SER for the chosen model.
2. **KER by `entity_category`** — drug vs procedure vs anatomy vs ... This is what the user actually cares about for deployment.
3. **KER by `ipa_source`** — **the most informative single number in the leaderboard.** The delta between `merriam-webster` and `magpie_g2p` rows is the proof the SSML override pipeline is doing real work. *Read this section aloud to the user.*
4. **KER by `noise_level`** — clinical environments are loud. `snr_5db` rows are closer to reality than `clean`.
5. **Per-term KER** (worst first) — these are your Stage 4 fine-tune targets.

A representative `ipa_source` split with the merriam-webster vs magpie_g2p delta interpretation: `references/scoring-recipes.md` §Representative ipa_source split. The delta tells the deployment story — if the user sees a wide gap and asks "should we fine-tune?", the answer is *not yet*; route them back to `/digital-health-clinical-asr-build`'s IPA QA pipeline (Stage 2d). See the decision tree below.

## Decision tree (after eval)

Read the **priority-category KER** (drug KER for most clinical workflows, procedure KER for surgical workflows) and route:

| KER on priority category | Recommend |
|---|---|
| **> 0.3** | `/digital-health-clinical-asr-finetune`. Manifest is already NeMo-format-ready. Note: rows ≥ 100 is the minimum for a believable fine-tune signal; if the manifest is smaller, grow it first via `/digital-health-clinical-asr-build`. |
| **0.1 – 0.3** | Either expand the term list (back to `/digital-health-clinical-asr-build` with new domain terms — usually surfaces more failures cheaper than tuning) **or** fine-tune. On a *first* eval, expand. On a *later* eval where you've already grown the manifest, tune. |
| **< 0.1** | Strong baseline. Don't tune yet — you'd be optimizing against a saturated metric. Push the eval harder: add voices, noise levels, contexts, adversarial terms. Loop back to `/digital-health-clinical-asr-build`. |

**Special case — `merriam-webster` rows score well but `magpie_g2p` rows are bad.** That's a pronunciation-hint coverage gap, **not a model gap**. Route back to `/digital-health-clinical-asr-build` Step 2d (IPA QA review), not to `/digital-health-clinical-asr-finetune`. Fine-tuning over a TTS-pronunciation gap teaches the model to mis-recognize the model's own mistakes — the wrong fix.

## Examples

**Scenario A — first eval on a fresh cycle-1 manifest.** User: *"I have `manifest.jsonl` with 200 clinical audio rows already, with `term` and `entity_category` fields. How do I score it?"* → Skip Stage 2 entirely. Run the audio-existence pre-flight. Pick `parakeet-tdt-0.6b-v2` (default) and echo the choice + resolved function-id. Run the inlined Step 3b recipe (`transcribe_manifest(...)`). Score the four metrics. Produce the five-section leaderboard. Read the by-`ipa_source` split to the user. Apply the decision tree against drug KER.

**Scenario B — interpreting a mixed result.** User: *"Eval shows KER 0.05 on rows tagged `merriam-webster` but 0.40 on rows tagged `magpie_g2p`. Should I fine-tune?"* → No — this is the special case. The model is fine; the pronunciation hints aren't covering the long-tail terms. Route the user back to `/digital-health-clinical-asr-build` Step 2d to audition the `magpie_g2p` rows and append verified IPA to `pronunciation_overrides.csv`. Re-run Stage 3 after the rebuild before reconsidering Stage 4.

## Artifacts produced

- `per_sample.json` — per-row transcription results with all clinical-extension fields preserved (the ASR `hyp` joined to the manifest's `ref` and metadata)
- `results.csv` — per-row WER/CER/KER/SER scores
- `leaderboard_cycle<N>.md` — five-section markdown report

(File names are user-chosen; the names above are conventions the rest of this skill assumes.)

## Troubleshooting

- **"No manifest found"** → user skipped Stage 2. Route to `/digital-health-clinical-asr-build` or confirm `$MANIFEST_PATH`.
- **All rows KER=1** → normalization mismatch between `ref` and `hyp`. Apply the four normalization steps to both sides.
- **All rows KER=0 but WER high** → likely misaligned manifest (audio row mismatch). Spot-check a few `(ref, hyp)` pairs by hand.
- **`merriam-webster` low, `magpie_g2p` high** → pronunciation-coverage gap. Route to `/digital-health-clinical-asr-build` Step 2d. **Don't fine-tune** — model isn't the problem.
- **Both `merriam-webster` and `magpie_g2p` high** → real model gap. Stage 4 is the right route (manifest ≥ 100 rows).
- **`clean` rows fine, `snr_5db` balloons** → robustness gap; expand noise diversity via `/digital-health-clinical-asr-build`.
- **Riva-NIM and offline NeMo results diverge** → Riva preprocessing / `riva-build` flags. Route to `/riva-asr-custom`.
- **`RESOURCE_EXHAUSTED` on large manifests** → retry after 30 s; slice + re-run dropped rows. Built-in backoff: `/riva-asr`.
- **`Auth.__init__() got 'ssl_cert'`** / **CUDA illegal-memory-access on Parakeet function ID**: see `references/offline-asr-recipe.md` (ssl_root_cert rename + §Whisper fallback).

Anything else: identify the upstream owner. ASR protocol / NIM deploy → `/riva-asr`. Scoring → here.

## Limitations

- **English-only by default.** Tokenization + normalization assume Latin script and en-US lexicon.
- **Strict-contiguous KER is conservative.** A near-miss like `cefa zolin` counts as a miss. That's intentional — pharmacy lookups fail on near-misses. Users wanting "soft" matching can switch to phoneme-level edit distance, which is a methodology extension, not a config tweak.
- **One model per eval run.** Comparing two models means running the eval twice and diffing the two `leaderboard_cycle<N>.md` files (or extending the recipe to write multi-model rows yourself).
- **Hosted-only paths assumed.** Self-hosted NIMs work but require `/riva-nim-setup` first.

## Next steps

- **Forward (KER > 0.3, manifest ≥ 100 rows):** `/digital-health-clinical-asr-finetune`.
- **Back to build (KER 0.1–0.3 on first eval, or `magpie_g2p` gap):** `/digital-health-clinical-asr-build`.
- **Stop (KER < 0.1):** the eval is saturated. Harden it before declaring victory.
- **Lateral** for ASR protocol / auth / streaming / self-hosted NIM details: `/riva-asr`.

## References

- [`references/offline-asr-recipe.md`](references/offline-asr-recipe.md) — full Step 3b Python recipe (`transcribe_manifest`, `resolve_asr_config`, `build_asr_auth`), function-ID catalog with call-shape notes, Whisper fallback, self-hosted Riva NIM setup
- [`references/scoring-recipes.md`](references/scoring-recipes.md) — pure-Python WER/CER/KER/SER scoring functions with the canonical 4-step normalization
Source

Creator's repository · nvidia/skills
View on GitHub ↗
License: Apache-2.0
Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk