explainer

Skill file

Preview skill file
---
name: explainer
description: ~60-80s explainer video for any URL — GitHub repo, product page, docs site, blog post, or launch. Canonical workflow for URL walkthroughs. Use when the user asks to "explain this URL / repo / website / product", "make a walkthrough video for [url]", "demo this site", "Loom-style explainer of [url]", "explainer for github.com/...", or "explain this product link". Drives a real browser through the URL, generates an avatar lipsync, and composites in a 1280×800 macOS Sonoma frame with a 246-pixel bottom-left avatar circle. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); other URLs use a generic page-walkthrough flow.
argument-hint: <url> [--focus angles] [--avatar url] [--voice id] [--live-url url] [--lipsync-provider pika|kling] [--no-captions] [--preview]
---

# /pika:explainer

Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.

**Usage:** `/pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]`

## Behavior

### Defaults — fire fast, no mid-flow confirmation

- **Use identity-store defaults silently for avatar / voice.** Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (`--avatar`, `--voice`) when supplied; otherwise resolve via `identity_avatar_url` / `identity_voice_id` and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
- **No mid-flow "type yes to proceed" gates by default.** Step 5 preview is **opt-in via `--preview`** (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
- **Do not solicit `--focus` either.** Make a confident first attempt from page structure; users re-run with `--focus "X"` if the angle missed.

These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.

### Local avatar images on Claude Desktop

Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as `--avatar`, pause Step 1 and kindly send them this — something like:

> Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:
>
> - **Paste a URL** if it's already hosted (Imgur, S3, your site) — fastest
> - **Attach the image file** so I can upload it before generation.

When a local file arrives, convert it to a public URL with `upload_asset` and use the returned `public_url` as `--avatar <url>` before Step 1. Already-hosted `https://...` URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.

### Step 0 — Resolve URL (empty-args menu)

Strip flags (`--focus`, `--avatar`, `--voice`, `--live-url`, `--lipsync-provider`, `--no-captions`, `--preview`, `--skip-preview`, `--yes`) and `key=value` parameters from `$ARGUMENTS`. **If what remains contains no `https://...` URL** (or is empty / whitespace-only), print this menu **verbatim** as your full response, then **stop and wait for the user's next message**. Calling a tool here risks recording or explaining the wrong page. If `$ARGUMENTS` already carries a URL, skip this step silently and proceed to Step 1.

> **Which URL would you like me to walk through?** Works on any of:
>
> - **A GitHub repo** — e.g. `https://github.com/anthropics/claude-code` (activates repo-aware mode: README scan + live-demo detection)
> - **A product page / launch page** — e.g. `https://pika.art`
> - **A docs site** — e.g. `https://docs.anthropic.com`
> - **A blog post / article URL**
>
> Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass `--preview` if you want a 3-second lipsync sanity check first.
>
> Reply with the URL and I'll start.
>
> *Tip: you don't need to type `/pika:explainer` — just say things like "walk me through <url>", "make a demo video of <url>", or "explain this repo: <github-url>" and I'll fire this skill automatically.*

When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.

### Step 1 — Parse input + detect mode

Required: `url` (must be `https://...`).
Optional: `--avatar <url>` (overrides identity-store default), `--voice <minimax-voice-id>`, `--focus "..."` (editorial guidance woven into vo_text), `--live-url <url>` (force-supply live demo URL — GitHub mode only), `--lipsync-provider <pika|kling>` (defaults to **`pika`** — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass `kling` for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), `--no-captions` (skip the Step 11 caption burn — default is captions on), `--preview` (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). `--skip-preview` and `--yes` are accepted as no-ops for backward compatibility.

**Mode detection:**
- **GitHub mode** — URL host is `github.com` AND path matches `/{owner}/{repo}` (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
- **Generic-URL mode** — anything else (a product page, docs site, blog post, deeper GitHub path like `/blob/HEAD/path`). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.

**Avatar resolution (silent — never ask the user):**
1. If `--avatar <url>` was passed, use it.
2. Else call `mcp__pika__identity_avatar_url`. If non-null, use it.
3. Else (fresh user, no identity avatar set yet) call `mcp__pika__generate_image` once with prompt `"professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"` and use the returned URL. Do **not** ask the user "should I generate one?" — just generate silently.

**Voice resolution (silent — never ask the user):**
1. If `--voice <id>` was passed, use it.
2. Else call `mcp__pika__identity_voice_id`. If non-null, use it.
3. Else pick a casual MiniMax `speech-2.8-hd` preset matching the resolved avatar's apparent gender:
   - **Female-coded avatar** → `English_PlayfulGirl` (warm, casual, clearly female-voiced — verified)
   - **Male-coded avatar** → `English_Jovialman` (warm, casual male)
   - **Unclear / gender-neutral** → `English_Jovialman` (default)

   Determine gender from `mcp__pika__identity_persona_read` (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do **not** call `analyze_media` for this — it's not worth the extra ~30s round-trip. Do **not** ask the user.

   **Do NOT use `English_FriendlyPerson`** — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. `English_PlayfulGirl` is the canonical casual-female pick. Other verified-female alternates: `English_Upbeat_Woman`, `English_LovelyGirl`, `English_radiant_girl`.

The flow below is annotated per step: **GitHub-only**, **Generic-only**, or **Both** modes.

### Step 2 — Read source (no MCP call)

**Both modes:** use Claude's `WebFetch` on the input URL to pull the page's main content (h1, hero section, headings, primary copy).

**GitHub mode additions:** also fetch top-level file tree, (best-effort) `package.json` / `pyproject.toml`, and GitHub API repo metadata via `gh api repos/{owner}/{repo}` for `homepage`, `description`, `language`, `topics`. Detect a candidate `live_url` in this priority:

1. User-supplied `--live-url`.
2. **GitHub API `meta.homepage` field** — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball `repo_analyzer.py:66-77`).
3. `package.json` `"homepage"` field.
4. First match in README of `https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*`.
5. **Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at.** The allowlist regex above misses arbitrary custom domains (e.g. `<project>-project-page.com`); when the README explicitly designates a project page, prefer that over the github.io fallback.
6. GitHub Pages convention `https://{owner}.github.io/{repo}` — but only if the deep tree contains a frontend signal (one of `index.html`, `App.tsx`, `App.jsx`, `App.vue`, `app.py`, `main.py`).

If no candidate resolves, the beat sheet skips beats 6–7.

**Generic-URL mode:** the input URL itself is the only URL the beats walk through — no `live_url` inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.

### Step 2.5 — Verify `live_url` reachability (GitHub mode only, no MCP call)

If a candidate `live_url` was selected, verify it serves real content **before** authoring beats 6–7. Use `WebFetch` on the candidate and check the response:

- If the response status is 4xx / 5xx, **drop `live_url` to None** and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
- If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains `"There isn't a GitHub Pages site here"`), drop `live_url` and skip beats 6–7.
- Otherwise, keep `live_url` for beats 6–7.

This mirrors the original tarball's `requests.head(live_url, timeout=6, allow_redirects=True)` reachability gate.

### Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.

**A. Bot-block / captcha detection — abort if matched:**

If the response body contains any of:

- `"Verify you are human"` / `"verify you are not a robot"`
- `"captcha"` / `"CAPTCHA"` / `"reCAPTCHA"`
- `"403 Forbidden"` / `"Access Denied"`
- `"Just a moment"` + `cf-chl-bypass` (Cloudflare challenge)
- `"We're sorry, something went wrong"` (Amazon-style bot block)
- A `<title>` or h1 of just "Robot Check" / "Are you a robot?"

→ **ABORT** with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."

**B. Cookie / consent-banner detection — defuse with `extra_css` + optional click:**

Scan the response for these patterns (case-insensitive):

- IDs / classes starting with `onetrust-`, `truste-`, `cookie-banner`, `cookie-consent`, `gdpr-`, `consent-`, `cmp-`
- Buttons matching `(?i)accept (all )?cookies` / `(?i)agree.{0,10}cookies` / `(?i)i (accept|agree)`
- Apple-specific banner: id `ac-gdpr-banner` or class `as-globalfooter-curtain`
- Google consent: `[role="dialog"]` with text "Before you continue"

If detected, set `cookie_banner_present = true`. Defense in depth — the recording uses BOTH:

1. **CSS injection (`extra_css`)** in the `capture_website` call to hide common banners universally — even if the click below misses, the banner is visually gone.
2. **A `click` `timed_action`** at `at_s: 0.0` against the most likely dismissal selector (extracted from the WebFetch DOM, e.g. `#onetrust-accept-btn-handler`, `[aria-label*="Accept all" i]`, `button[id*="accept"]`).

The `extra_css` payload (use this verbatim — covers ~80% of consent platforms):

```
#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }
```

**C. Real-DOM element identification — emit concrete selectors:**

Generic CSS selectors (`h1`, `[class*="hero"]`, `section h2`) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses `tile-headline` / `as-headline-section-title`, not `hero-*`). For each beat, prefer the **actual DOM elements** observed in the WebFetch:

- Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary `<h1>` text and class.
- Note the page's section structure (h2 headings + their parent containers).
- Note any prominent CTA / signup / pricing element.
- Emit `zoom_target.selector` using **the actual class or id observed**, falling back to semantic structure (`main > section:nth-of-type(N) h2`) when class names look auto-generated (Tailwind `_1a2b3c`, CSS modules `module__hero___xYz`).

**D. SPA / lazy-render detection — bump initial wait:**

If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-`domcontentloaded`. Emit a longer initial `wait` action (`{type: "wait", at_s: 0.0, ms: 2500}`) before any beat fires, instead of the default 600ms settle.

**E. `--focus` is honored when supplied (do not solicit):**

Without `--focus`, select beats from generic structure cues — proceed silently with a confident first attempt. Do **not** ask the user "what should I focus on?" before firing; users iterate by re-running with `--focus "the X feature"` if the first pass misses the angle they wanted. With `--focus` supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.

### Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

Before authoring the beat sheet, **scan the README** (case-insensitive, full-text) for any of these section names. If a match is found, you **must** add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:

| README contains... | Required beat |
|---|---|
| `how it works` | scroll_to that heading; zoom `article h2:has(#user-content-how-it-works)` |
| `audio layer` / `audio timeline` | scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading |
| `claude code` / `mcp integration` | scroll_to that section; zoom `article pre` or `.highlight` (terminal screenshot / code block) |
| `architecture` / `system design` | scroll_to that section; zoom `article h2:has(#user-content-architecture)` |
| `features` (when prominent at top) | scroll_to that heading; zoom `article h2:has(#user-content-features)` |
| `getting started` / `quick start` / `installation` | scroll_to that heading; zoom `article h2:has(#user-content-installation)` (or the matching slug) — falls back to `article pre` if you want the install code block instead |
| `usage` / `examples` | scroll_to that heading; zoom `article h2:has(#user-content-usage)` (or the matching slug) — or the first code block under it |

**GitHub heading slug rule:** lowercase, spaces → dashes, strip non-`[a-z0-9-]` characters. So "How it works" → `#user-content-how-it-works`, "Quick Start" → `#user-content-quick-start`. GitHub injects the `<a id="user-content-{slug}">` anchor inside each rendered `<hN>`, so `hN:has(#user-content-{slug})` reliably grabs the heading element across any GitHub README.

**Selector contract:** `bbox_selector` needs to be vanilla CSS that resolves via `document.querySelector` (`capture_website` runs the post-action smooth-scroll JS via `page.evaluate`, which uses the browser's native selector engine). Avoid Playwright extensions like `:has-text("...")`, `text=...`, or `:visible`: those resolve in Playwright's `page.query_selector` (so the bbox capture finds the element) but silently fail in the smooth-scroll's `document.querySelector` (so the page never scrolls to the target, and `bbox.y` ends up at document-Y instead of `top - 60 px`, which trips Step 8b's `bbox.y > recording_viewport.h` degenerate filter and falls back to default-position zoom). CSS Level 4 `:has(...)` is vanilla and supported in modern Chromium.

These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with `SPECIAL` rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.

### Step 3 — Author beat sheet (main thread, no MCP call)

Write a JSON array of 8–10 beats, **with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words** (assuming a speaking rate of 2.5 words/sec). Each beat:

```jsonc
{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "exact words to speak — 1 to 2 conversational sentences"
}
```

**Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):**
1. Every beat needs all five fields: `t_start`, `t_end`, `action` (with `type` and `url`), `zoom_target` (with `selector`), `vo_text`. Missing fields ⇒ reject and re-author. (Mirrors tarball's `github_explainer.py:183-190` validation pass.)
2. `t_start` of beat 0 = 0.0; `t_end[i] == t_start[i+1]` (continuity).
3. `len(vo_text.split()) / 2.5` ≈ `t_end - t_start` per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the `vo_text` until it fits.
4. **Total `t_end` of last beat ≤ 80 seconds.** (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
5. **Total spoken word count between 165 and 200 words.**
6. Every beat's `zoom_target.selector` needs to be a valid CSS selector for the page that beat lands on. **GitHub mode prefers** GitHub-specific selectors: `h1.f1`, `#readme`, `article h2`, `.blob-code-inner`, `.highlight`, `.octicon-star`, `nav`. **Generic-URL mode prefers** robust generic selectors: `h1`, `[role="main"]`, `main`, `header`, `nav`, `.hero`, `.feature`, `section h2`, `[class*="cta"]`, `[class*="hero"]`, `button`, `a[href]`. **Selectors need to resolve on the rendered page after the beat's action settles** — verify against the DOM you can see via WebFetch before emitting.
7. `vo_text` is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
8. `action.url` is a valid `https://...` URL when `action.type == "navigate"`; required.

**Self-check before Step 4:** verify `total_words` is in `[165, 200]` AND `total_seconds` (= `beats[-1].t_end`) is in `[65, 80]`. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)

**Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):**

- **Beat 1:** `navigate` repo root, zoom `h1.f1` (repo title), hook sentence.
- **Beats 2–3:** `navigate` to specific source files (`https://github.com/{owner}/{repo}/blob/HEAD/<path>`), zoom `.blob-code-inner` or `.highlight`. Pick files that match the narration's claim — don't navigate to a file you won't talk about.
- **Beats 4–5:** `scroll_to` README sections, zoom `article h2` or `#readme`. **If Step 3.0 surfaced required sections, replace these slots with the required ones.**
- **Beats 6–7 (only if `live_url` survived Step 2.5):** `navigate` to `live_url`, zoom `nav` / `h1` / `.hero` / `main` / `button` / `.feature`.
- **Beat 8:** back to repo root, zoom `.octicon-star`, outro.

**Structural skeleton — Generic-URL mode:**

- **Beat 1:** `navigate` to the input URL, zoom `h1` or `[class*="hero"] h1` (the page's primary headline), hook sentence.
- **Beats 2–3:** `scroll_to` the page's hero / value-prop / first feature section. Zoom `.hero`, `[class*="hero"]`, `[class*="feature"]`, or `section:nth-of-type(1) h2`. Pick visible elements the narration references.
- **Beats 4–5:** `scroll_to` deeper sections — feature lists, screenshots, pricing, social proof. Zoom `section h2`, `[class*="feature"] img`, `[class*="testimonial"]`, `[class*="pricing"]`, or any prominent semantic element on the page.
- **Beats 6–7:** `scroll_to` CTA / signup / demo embed. Zoom `[class*="cta"]`, `button`, `a[class*="button"]`, or `[id*="signup"]`. (No live-demo navigation in generic mode — the input URL IS the demo.)
- **Beat 8:** `scroll_to` footer / closing element, zoom `footer h2`, `footer`, or back to top with `h1`. Outro sentence.

If `--focus` is supplied, weave its angles into `vo_text` without mutating the structural skeleton. Prefer **CSS selectors over `text_content`** in `zoom_target.selector` — bbox capture is selector-only (see Known gaps).

### Step 4 — TTS

Call `mcp__pika__generate_speech` with `provider: "minimax-tts"`, `text: <full vo_text join>`, optional `voice_id`. Capture `result.audio_url` (the dispatcher returns audio under `audio_url`, not `url`) and `result.duration_seconds`. Voice defaults to identity-store injection in plugin mode.

**Stale-voice fallback detection (AGNT-231):** the dispatcher retries once with the default `Calm_Woman` voice on Minimax `status_code:2054` (voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough): `voice_id_requested` (the planted-but-stale id the worker tried first) and `fallback_reason: "invalid_minimax_voice_id"`. **If you see `fallback_reason == "invalid_minimax_voice_id"` in the response, surface a one-line note to the user along the lines of:** "your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via `clone_voice` if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.

**Cookie-banner audio padding (Generic-URL mode with `cookie_banner_present == true` from Step 2.6 §B):** prepend MiniMax's pause marker `<#1.5#>` to the `text:` argument **before** calling `generate_speech`. MiniMax's `speech-2.8-hd` honors `<#N#>` as N-second silence; the returned `audio_url` and `duration_seconds` include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.

**Fallback** (only if smoke-test shows the marker is ignored on this voice): call `generate_speech` normally, then `mcp__pika__edit_audio_mix` to overlay the result onto a 1.5s silent base at offset 1.5s. **Then call `mcp__pika__analyze_media(url=<padded_audio_url>)` to probe the padded duration and rebind `duration_seconds = result.duration_seconds`** before Step 4.5 consumes it. `analyze_media` is the **single authoritative duration probe** — do not rely on `edit_audio_mix`'s return payload (its duration field is not contractually guaranteed).

### Step 4.5 — Audio length verification + beat-sheet rescale

Applied to `audio_duration_seconds` post Step 4 (which includes any cookie lead-in pad). End state: `beats[].t_start` / `t_end` are absolute wall-clock seconds matching the audio playback timeline. **All `beats[]` mutations happen here**; Steps 6 and 8 are read-only consumers.

**Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds):**
If `audio_duration_seconds > 90`, abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.

**Gate 2 — Degenerate TTS (spoken-content length):**
Compute `narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)`. If `narration_duration < 30s`, retry Step 4 once (and recompute `narration_duration` from the retry's audio). If the retry also returns `narration_duration < 30s`, abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.

**Gate 3 — Rescale:**
- `narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)`
- `scale = narration_duration / beats[-1].t_end`
- If `scale < 0.5` or `scale > 1.5`, abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it.
- For each beat: `beat.t_start *= scale; beat.t_end *= scale`
- If `cookie_banner_present`: for each beat, `beat.t_start += 1.5; beat.t_end += 1.5`
- **Final clamp:** `beats[-1].t_end = audio_duration_seconds` (exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.

**After Gate 3 passes**, emit a one-line operator log to surface the scale value for post-run diagnosis:

```
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)
```

**Advisory (not a gate):** scale near 1.0 is ideal. `scale > 1.2` means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync. `scale < 0.85` means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.

### Step 5 — Preview gate (opt-in via `--preview`)

**Skip Step 5 entirely by default.** Proceed directly to Step 6 unless the user explicitly passed `--preview` — do **not** generate a preview, do **not** ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.

`--skip-preview` and `--yes` are accepted as no-ops for backward compatibility — they were the old opt-out flags.

If `--preview` was supplied:

1. `mcp__pika__generate_speech` with `text: "Hi, I'm your presenter. Let's explore this repo together."` → `preview_audio_url`.
2. `mcp__pika__generate_lipsync` with `provider: <resolved_lipsync_provider>` (defaults to `pika`; honor `--lipsync-provider kling` if supplied), `image: <avatar>`, `audio: preview_audio_url` → `preview_lipsync_url` (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
3. Present to the user verbatim:

   > Preview ready: `<preview_lipsync_url>`
   > This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio).
   > Reply `yes` to proceed, or anything else to cancel.

4. Match `^(yes|go|proceed|confirm|y)$` (case-insensitive). Anything else → STOP, no further MCP calls.

### Step 6 — Build `timed_actions` and record

Translate the beat sheet into `capture_website` `timed_actions`. **One `timed_action` per beat** — set `bbox_selector` to the beat's `zoom_target.selector` and `capture_website` captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-`top - 60 px` → 1300 ms post-anim → measure, all server-side).

For each beat in order, emit one entry:

- **`navigate` beats**: `{type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}`. The worker navigates, waits to absolute `at_s + 0.6 s`, scrolls `bbox_selector` into view, and measures the bbox — all without the caller scheduling a follow-up step.
- **`scroll_to` / `hover` beats**: `{type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}`. The action's own `selector` drives the page scroll; `bbox_selector` drives the bbox measurement (it can be the same selector or different — usually the same). (`capture_website` has no `hover`; scroll-into-view is the analog.)

**Do NOT prepend the eight-step intro scroll-through** that the tarball ran. The lipsync audio is timed from `t=0` of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at `t=0` with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.

Call `mcp__pika__capture_website`:

- `url: <beat 0's action.url>`
- `timed_actions: <the N-element list built above>` (one entry per beat)
- `duration_s: ceil(audio_duration_seconds)` — `beats[].t_start` and `t_end` have already been rescaled to the TTS audio timeline by Step 4.5, so `duration_s` is simply the audio length. The old `max(...)` defense against TTS overrun is no longer needed.

**Generic-URL mode additions** (per Step 2.6 pre-flight):

- `extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B>` — defensive: hides common consent platforms via `display: none !important;` so even if the optional click misses, the banner is invisible in the recording.
- **Prepend a `wait` action** `{type: "wait", at_s: 0.0, ms: 2500}` for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
- **If `cookie_banner_present` from Step 2.6 §B**, also prepend a `click` action `{type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}`. **The `beats[]` array has already been shifted by `+1.5s` in Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1's `timed_action.at_s` reads `beats[0].t_start` directly, which is 1.5 in cookie mode.**
- **No cookie banner action needed** if `cookie_banner_present == false`; just the prepended wait action.

Capture `video_url`, `recording_viewport`, `action_bboxes`. The result returns `recording_viewport: {w, h}` and `action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}]` alongside `video_url`.

**`action_bboxes[].idx` semantics:** the `idx` field is the position in the **input** `timed_actions` array.

- **GitHub mode**: with one timed_action per beat, `idx` maps 1:1 to beat index — Step 8 uses `entry.idx` directly as `beat_idx`.
- **Generic-URL mode**: the prepended `wait` (and optional cookie-dismissal `click`) shift the array by 1 or 2. Compute `beat_idx = entry.idx - prepend_count` where `prepend_count` is 1 (wait only) or 2 (wait + click). Skip entries where `beat_idx < 0` (those are the prepended setup actions, not beats).

The `selector` field on each entry reports `bbox_selector` (i.e. `zoom_target.selector`), not the action's own `selector`.

### Step 7 — Browser chrome

`mcp__pika__edit_browser_frame`:

- `video_url: <Step 6 video_url>`
- `url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)`
- `tab_title: <30-char title>` — GitHub mode: `(meta.description or repo_name or "")[:30]`. Generic-URL mode: the page's `<title>` (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against `None`/empty.

Returns `framed_url` (1280×800 Sonoma + chrome).

### Step 8 — Build `zoom_keyframes` and apply

Constants:

- `INTRO_BEATS = 2` — gates by **beat-sheet index**. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
- `HOLD_GAP = 0.6` — seconds of 1.0× before each zoom-in and after each zoom-out.
- `MIN_BEAT_DUR = 1.5` — beats shorter than this are skipped (no room for a meaningful zoom).
- `SCALE = 1.35` (precise element-targeted zoom).
- `FALLBACK_SCALE = 1.25` (default-position fallback when no usable bbox).
- `FALLBACK_RAMP = 0.4`.

**Note:** `beats[].t_start` / `t_end` were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.

`edit_browser_frame`'s inner-content offsets: `CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637` (verified against the worker's `edit_browser_frame/main.py`).

Coord transform (recording px → framed px):

```
cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)
```

**Build the zoom list with a per-beat default + bbox override pattern.** The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:

**Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.**

Constants for the default position:
- `DEFAULT_CX = 56 + 1168 // 2` (screen center of the framed canvas)
- `DEFAULT_CY = 108 + 637 // 3` (upper-third of the content area, where most GitHub UI prominence lives)

Walk the beat sheet from index `INTRO_BEATS` (= 2) to the end. For each beat:

- If `t_end - t_start < MIN_BEAT_DUR` (1.5s), skip — too short for a meaningful zoom.
- Compute the keyframe's interior interval as `[t_start + HOLD_GAP, t_end - HOLD_GAP]`. If that interval is shorter than 1.0s, skip.
- Otherwise pre-fill that beat's slot in a per-beat map (call it `zoom_keyframes_by_beat[beat_idx]`) with `{cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}` plus the trimmed `t_start`/`t_end`.

**Step 8b — Override with bbox-derived precise zoom where `action_bboxes` provided a usable measurement.**

For each entry in `action_bboxes`:

- `beat_idx = entry.idx` (since Step 6 emits one timed_action per beat). If `beat_idx < INTRO_BEATS`, skip.
- If `entry.found` is false, skip.
- If the beat isn't already in `zoom_keyframes_by_beat` (was filtered out in Step 8a by `MIN_BEAT_DUR`/`1.0s` rules), skip.
- **Filter degenerate bboxes:** skip if `bbox.y > recording_viewport.h` (offscreen capture — page didn't scroll the element into view in time) or `bbox.h > recording_viewport.h * 1.5` (full-page `<main>` element — yields a meaningless zoom center).
- Compute `cx_framed`/`cy_framed` from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with `{cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}`.

**Final list:** sort the values of `zoom_keyframes_by_beat` by `t_start` to produce the `zoom_keyframes` array.

This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.

If `len(zoom_keyframes) > 0`, call `mcp__pika__edit_animate_zoom` with `video_url: framed_url, zoom_keyframes`. Returns `zoomed_url`. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use `framed_url` as `zoomed_url`.

### Step 9 — Lipsync the full audio

`mcp__pika__generate_lipsync`:

- `provider: <resolved_lipsync_provider>` — **default: `pika`** (parrot a2v). Honor `--lipsync-provider kling` if explicitly passed.
- `image: <avatar>`
- `audio: <Step 4 audio_url>`
- **kling-only knobs** (since `pika-mcp-server` BACK-339, 2026-05-10): when `provider == "kling"`, add `mode: "pro"` and `prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"` for the polished-presenter feel. Both are silently ignored on `pika` (parrot has its own driver).

**Provider tradeoffs:**

| Provider | Wall-clock | Head motion | When to use |
|---|---|---|---|
| **`pika`** (default) | ~2–5 min | Slightly more dramatic, naturalistic | Default for most runs — fast iteration, watchable output, ~10× faster than kling |
| `kling` (opt-in) | ~5–30 min | Minimal, face-centered, presenter-style | High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole |

Server-side-await covers the call inline; if the response shape is `{task_id, status: "queued"}`, poll `mcp__pika__task_status` in a tight loop (no sleep) until the status reaches a terminal state (`completed`, `failed`, or `cancelled`). On `completed`, capture `lipsync_url`. On `failed` / `cancelled`, fall back to the **other** provider (kling ↔ pika) per the failover note below.

**Failover:**
- If `pika` fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with `provider: "kling"`.
- If `kling` stalls past the worker's 1200s ceiling (visible as repeated `processing` status with no completion) → fall back to `provider: "pika"`. Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.

**Why pika is the default:**
- Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
- Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
- Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.

For the canonical "polished presenter" feel of the original tarball reference output, pass `--lipsync-provider kling` explicitly.

### Step 10 — PiP composite

`mcp__pika__edit_pip`:

- `main_video_url: <zoomed_url>`
- `overlay_video_url: <lipsync_url>`
- `shape: "circle"`
- `size_px: 246`                    ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's `CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2`
- `stroke_width_px: 3`
- `stroke_color: "white"`
- `position_px: {x: 20, y: 476}`    ← `800 − 246 − 78` for dock clearance (matches tarball's `H − CIRCLE_OUT − 78`)

Pass `size_px`, not `size`; the fields are mutually exclusive. Returns `final_url`.

**Master-duration / audio-source contract** (matching tarball `github_explainer.py:418-419, 531-533, 578-582`): `edit_pip` uses `shortest=1` semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's `duration_s = ceil(audio_duration_seconds)` ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via `edit_pip`'s `shortest=1` semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone `audio_url` is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.

### Step 11 — Burn captions

Call `mcp__pika__add_captions(video_url=<final_url>, style="classic")`. `classic` renders a bottom subtitle bar — the right register for an explainer video (use `tiktok` / `hormozi` / `karaoke` only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as `captioned_url`.

Skip this step only if the user passed `--no-captions` (parsed in Step 1) — the default is captions on. (Note: `/pika:podcast` does **not** burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)

### Step 12 — Return

Emit `captioned_url` (or `final_url` if Step 11 was skipped) on one line: `Done: <url>`.

## Load-bearing phrases

These anchors preserve the visual contract across page types:

| Phrase | Where | Why load-bearing |
|---|---|---|
| `vanilla CSS that resolves via document.querySelector` | Selector contract | Keeps scroll, bbox capture, and zoom targeting aligned inside `capture_website`. |
| `GitHub URLs activate repo-aware mode` | Mode detection | Prevents generic product-page beats from replacing README/code walkthrough beats. |
| `8-10 beats`, `65-80 seconds`, `165-200 words` | Beat-sheet authoring | Keeps narration, screen recording, lipsync, and captions within the reliable duration envelope. |
| `all beats[] mutations happen here` | Audio rescale step | Ensures later capture/zoom/composite steps consume one stable timeline. |
| `extra_css` cookie-banner hiding payload | Generic URL pre-flight | Reduces first-frame banner occlusion when a banner click misses. |

## Engine choice: Pika lipsync default, Kling opt-in

Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests `--lipsync-provider kling` or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.

## Runtime expectations

Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:

| Step | Wall clock | Notes |
|---|---:|---|
| URL read + pre-flight | 10-60s | GitHub README scan or generic URL DOM/cookie checks |
| TTS + audio rescale | 30-90s | Beat timing is normalized after actual audio length |
| Screen recording | 60-180s | Depends on page load and navigation count |
| Browser frame + zooms | 1-3 min | Deterministic edit/composite stages |
| Lipsync | 2-5 min Pika / 5-30 min Kling | Kling is opt-in because it is the long pole |
| PiP + captions | 1-3 min | Captions skipped when `--no-captions` is set |

## Known gaps (carried as follow-up server-side work)

- ~~**Kling avatar `mode:"pro"` and `prompt` not exposed.**~~ **Resolved** by `pika-mcp-server` BACK-339 (PR #186, shipped 2026-05-10): `generate_lipsync` now wires both `mode` (e.g. `"pro"`) and `prompt` end-to-end for the kling provider. To enable polished-presenter mode here, pass `--lipsync-provider kling` and the Step 9 call should add `mode: "pro"` plus a prompt like `"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"`. Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap.
- **No caller-controlled white-frame trim on the screen recording.** `capture_website` has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms `wait` action at `at_s: 0.0` mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
- **No `networkidle` wait on per-beat navigation.** Tarball uses `page.goto(url, wait_until="networkidle", timeout=20000)` plus `wait_for_timeout(600)` after every navigate. `capture_website` settles to `domcontentloaded` plus the bbox-capture branch's 600 ms post-action settle (server-side, when `bbox_selector` is set), but SPA blob pages whose final render happens after `domcontentloaded` can still get bbox'd against unmounted code blocks. Worker enhancement: expose a `wait_until` knob on `timed_actions[].navigate`.
- **No per-step output-size verification gates.** Tarball's `verify()` helper at `github_explainer.py:35-39` checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra `mcp__pika__analyze_media` call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
- **`text_content` bbox capture not implemented.** `capture_website` v1 returns `action_bboxes` only for steps with a CSS `selector`. `text_content`-only steps produce no entry. Prefer CSS selectors in `zoom_target` for guaranteed zoom coverage.
- **Beat-sheet wording is non-deterministic.** Running the same input twice produces different vo_text and different zoom positions. Visual *kind* is the contract, not pixel-exact reproduction.
- **Generic-URL mode quality varies by site.** Modern indie / SaaS landing pages with semantic markup (`<h1>` + clear `<section>` + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) **bot detection** — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) **obfuscated class names** — `tile-headline` instead of `hero-title` defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) **scroll-triggered animations don't play** — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's `scrollIntoView`; the recorded frame may be a static placeholder; (d) **lazy-loaded images** — picture/source elements with `loading="lazy"` may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass `--focus "the X feature"` to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry + `wait_until=networkidle` + animation-trigger via `IntersectionObserver` polyfill).
- **Cookie-banner click is single-attempt.** Step 2.6 §B emits one `click` against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the `extra_css` payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per `click` action so the worker tries each in order.
- **Step 8b ↔ Step 6 `idx`-mapping mismatch in Generic-URL mode.** Step 6 maps `beat_idx = entry.idx - prepend_count`, but Step 8b uses `beat_idx = entry.idx` naively. Pre-existing bug independent of rescaling.

## Auth

If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for `@pika.art` Google login). For non-interactive environments, set `MCP_AUTH_TOKEN`.

## Examples

GitHub-mode (repo-aware: README scan + live-demo detection):

- `/pika:explainer https://github.com/leigest519/OpenGame`
- `/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"`
- `/pika:explainer https://github.com/openai/whisper --preview` (opt-in to the preview gate when testing a new avatar)

Generic-URL mode (any non-GitHub URL — drives through the page directly):

- `/pika:explainer https://pika.art`
- `/pika:explainer https://vercel.com --focus "the deployment workflow"`
- `/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins`
- `/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview`

Source

Creator's repository · pika-labs/pika-plugins

View on GitHub

Security

Security checks in progress
Results will appear here once audits complete
What this skill can do
Reads your filesConnects to the internetRuns code on your machine
Checked by 3 independent security firms
Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub
Does it sneak in hidden code?Not yet checkedPending · Socket
Does it have known bugs?Not yet checkedPending · Snyk