Skip to content

Editorial Standard — audioGuided session (narrated audio experience)

  • Directus reference: https://cmsdocs.naluma.space/sessions/audio-guided
  • Manifest entry: naluma-directus/authoring-docs/reference-manifest.jsonaudioGuided appears as the session_id enum value in the session_card block (domain coach-content); session-level surrounding fields (title, slug, duration, cover image) are documented there.
  • Schema: Pass-through; no session-content schema. The audioGuided template type carries no content JSON payload — the session is delivered entirely via audio assets managed in content.audio_assets. There is no naluma-directus/schemas/session-content/audioGuided.schema.json. Editorial governance applies to the narration script and transcript text, the why_this_works field, and citation fields authored in Directus.

Purpose

An audioGuided session is a voiced, coach-narrated experience — PMR, body scan, guided imagery, mindful acceptance, Leaves on a Stream, and related techniques. It is the deepest practice unit in the app: the user closes their eyes, hands over attention to the voice, and completes a structured protocol. The session's job is to facilitate a real shift in the user's relationship to the tinnitus sound or to their body's stress response. A session that leaves the user unchanged has failed, regardless of how pleasant the delivery was.

audioGuided sessions cover the programme's three core practice categories: relaxation training (PMR, autogenic training, diaphragmatic breathing, body scan), acceptance and defusion (Leaves on a Stream, cognitive defusion, RAIN, urge surfing), and specialised protocols (masking reduction, sleep-onset, morning activation, post-spike recovery). Staging in the programme matters: acceptance-based practices contraindicated in acute onset (Choiceless Awareness, Mindful Listening at full volume) require an in-app staging guard.

Voice register

Default register varies by technique and programme week:

  • Relaxation techniques (PMR, body scan, autogenic training): early habituation register -- companion-forward, still, warm. The voice guides without performing; it does not ASMR-whisper or over-soften. Delivery language is direct and concrete ("tense the muscles in your right hand, now release").
  • Acceptance/defusion practices (Leaves on a Stream, RAIN, cognitive defusion): habituation with deliberate pauses. The voice creates space for the user's own observation rather than filling every second. Less narration, more guided noticing.
  • Spike/crisis protocols (physiological sigh, post-spike sequence, crisis breathing): spike/acute register. Sharper, more active, maximally directive. No warmth-before-orientation. The user in a spike does not have bandwidth for a gentle preamble.
  • Sleep-onset variants: lower-pitched, quieter, pacing elongated (5-8 sec inter-phrase pauses per the IP guide). No re-arousal instruction. The voice trails the user toward sleep, not into wakefulness.

In all variants: the knowledgeable guide, not the meditation-app breathiness. State the technique, its mechanism, and the instruction. Do not narrate the emotional experience the user should be having.

Narration script and transcript

Two distinct artifacts — do not conflate them:

  • narration: (authored, session.md body) — the script the editor writes, a list of {text, pause_after_ms} segments. This is the pipeline input.
  • transcript ({ts_ms, text}) — the timed transcript the pipeline emits to audio/sessions/<slug>/<locale>.transcript.json. Never hand-authored; publish writes it to content.sessions_translations.transcript. It mirrors the narration text faithfully.

The body also carries register — voice-delivery tuning consumed by the pipeline (config.REGISTER_SETTINGS = breathing | narrative | exercise), distinct from the frontmatter session_subtype (library/UI classification). Both are currently authored; reconciling the overlap is tracked in #33.

pause_after_ms conventions

The authored pause_after_ms is the source of truth — what you write is what ships (the pipeline does not silently rescale it). A global --pause-scale finetuning knob exists on both naluma-audio render and voicelab, but it defaults to 1.0 (author's pauses as-is) and is only for one-off experiments, not a substitute for authoring the right values.

These targets reflect the 2026-06-12 voice-lab round (shorter, tighter pacing tested better):

Context pause_after_ms Notes
Narrative (listening / psychoeducation) 1500–2000 Tighter listening pace. A deliberate beat after a key idea can go to ~2500.
Exercise (guided practice) size to the action Default toward ~half the old 3000–4000 (so ~1500–2000), but never shorter than the action needs — the silence is the practice. A full breath cycle ≈ 4000–6000; a body-scan region ≈ 3000–4000; a brief "notice this" ≈ 1500–2000. Do not reflexively halve a pause the user is meant to do something in.
Sleep-onset variants 5000–8000 Intentionally elongated (per the IP guide); do not shorten.

(Breathing sessions are different: their loop phase-cue pauses are generated from content.pattern, not authored, and the prep pause before the rounds is a fixed contract — see session-breathing.md.)

Classification: session_subtype (narrative | exercise)

Every audioGuided session carries a required session_subtype (DB CHECK-enforced; the frontmatter field maps to content.sessions.session_subtype):

  • exercise — guided practice the user does: PMR, body scan, autogenic training, breathing protocols, Leaves on a Stream, RAIN. Instructional pacing with inter-phrase pauses. This is the default for the techniques described above.
  • narrative — long-form narrated psychoeducation the user listens to (audiobook pacing) — the audio sibling of the text insight template. Use only when the session is explanatory listening, not guided practice.

The value drives one app wording change and nothing else: after a narrative session the learn-more affordance reads "Read more" (psychoeducation — more reading is the natural next step); after an exercise session it reads "Why this works" (the user just practised — the mechanism explanation is the natural follow-up). Non-audioGuided template types must not carry a session_subtype.

Evidence / IP

Script development and IP sourcing rules are governed by docs/session-audio-protocols-ip-guide.md. Key grounding:

  • PMR: Edmund Jacobson (public domain, d. 1983); VA Whole Health Library scripts (US government public domain, fully usable). Clinical adaptation: Bernstein and Borkovec (1973). Tinnitus adaptation: avoid instructions drawing attention to auditory sensation during face/neck sequence.
  • Diaphragmatic breathing / box breathing / extended exhale: public domain. Lehrer and Gevirtz coherent breathing (PMC7578229). Frame extended exhale as interrupting the tinnitus-to-anxiety feedback loop.
  • 4-7-8 breathing: technique unprotectable; Weil's text and brand protected. Write own script; attribute to pranayama origin. Reference: PMC9277512.
  • Body scan: technique ancient and unprotectable; MBSR trademarked -- do NOT use "MBSR" in product. VA public domain scripts usable. Tinnitus adaptation: when attention reaches ears/head, normalise tinnitus as one sensation among many.
  • Leaves on a Stream / cognitive defusion / ACT exercises: ACBS treats these as freely shareable clinical material. Zetterqvist ACT-for-tinnitus thesis (open access). Tinnitus framing: place the sound on a leaf; watch it drift; you do not need to follow it.
  • RAIN technique: structure (Recognize, Allow, Investigate, Nurture) not proprietary; Tara Brach's exact text is copyrighted -- write own narration.
  • Physiological sigh: Yadin, Feldman, Huberman 2023 (Cell Reports Medicine, PMC9873947). Frame as fastest acute sympathetic downregulation at spike.
  • Masking reduction: TRT protocol documented in PMC8632517 (open access); VA PTM Clinical Handbook (public domain).

Clinical claim in the why_this_works field must be referenced to a study or programme name (not in the narration itself). The narration is authorial voice, not a literature review. Do not reproduce verbatim any copyrighted script text (Kabat-Zinn, Weil, Therapist Aid). Primary safe source: VA Whole Health Library.

Length / reading level

  • Narration script length: determined by the variant. Express/crisis variants: 5-7 min narration. Standard clinical variants: 12-20 min. Sleep-focused variants: 20-25 min. Intro audio asset should be proportional -- no 3-minute preamble for a 5-minute session.
  • On-screen description (shown on the session screen): author a single sentence, ≤120 characters (hard-capped by validate.py). State what the session does. Authored as a top-level body description: field; publish writes it into content.description.
  • Why_this_works field (shown post-session in Directus): 60-100 words. Mechanism-based. Names the technique, its clinical grounding, and what the user just practised. Not a review; not reassurance.
  • Reading level of on-screen text: Grade 8 or below. The intro card is read before the session begins and must be clear at a glance.

Editorial-required elements

  1. Technique fidelity before voice. The narration must follow the clinical structure of the technique (tense-release sequence, phase order, movement direction). Do not abbreviate the protocol to make the script shorter. If the session is too long, create a separate Express variant rather than cutting the standard form.

  2. No avoidance framing. Guided imagery safe-place scripts must frame the location as one where tinnitus recedes into the background -- NOT where it is absent. "A place where the sound is just one thing among many" not "a place where tinnitus disappears." Avoidance framing reinforces the wrong relationship.

  3. Mechanism statement in the intro or why_this_works. The user should know why this works before or after doing it, not just that it is good for them. "Box breathing regulates the autonomic nervous system by extending the exhale, which activates the parasympathetic response" is mechanism. "Breathing exercises can help with tinnitus" is not.

  4. Staging guard note where required. Acceptance-based practices that are contraindicated at acute stages (Choiceless Awareness, Mindful Listening, full Leaves on a Stream) must include a contraindicated_early flag in Directus -- do not omit it. Editorial review should flag any acceptance-technique session without the staging note reviewed.

  5. Context-specific variant labelling. Each variant (sleep, morning, spike, express, commute-compatible) must be editorially distinct -- different pacing, different closing instruction, appropriate register. A "sleep version" that has the same close as the standard version has not been adapted.

  6. Obeys ai-patterns-en.md. Intro copy, why_this_works, and on-screen text must obey all of ai-patterns-en.md. Narration scripts are voice-delivered and may use natural spoken phrasing -- but still no em dashes, no wellness clichés, no toxic positivity, and no cure-adjacent claims.

Examples

Good -- why_this_works for a PMR session:

Progressive muscle relaxation works by introducing deliberate muscle tension, then releasing it. The contrast between tension and release trains the nervous system to recognise and lower its baseline arousal state. For tinnitus, reducing physiological arousal reduces the threat signal the brain attaches to the sound -- not the sound itself, but its perceived urgency.

Why this works: names the mechanism (contrast, arousal reduction, threat signal), links it concretely to tinnitus, ends with a distinction that gives the user a correct mental model (arousal, not volume).


Bad -- intro card text:

❌ This powerful 15-minute body scan will help you find your peace and manage your tinnitus by bringing healing attention to every part of your body. You've got this.

Why it fails: "powerful", "find your peace", "manage your tinnitus", "healing attention", and "you've got this" are all banned under ai-patterns-en.md (Naluma-voice additions and wellness clichés). The intro contains no mechanism and no specific claim. A user cannot tell from this text what the session actually does or why.

Pre-publish audio QA

Before a locale's audio goes published, do a sampled listen-check on each rendered <locale>.m4a:

  • Right voice for the locale (see audio_pipeline/.../voices.toml).
  • Right language — spot-check 2–3 segments are spoken in the target language.
  • No artifacts — no clipping, abrupt cuts, or wrong-pace pacer cues.

The pipeline asserts the transcript sidecar matches the script automatically (naluma_audio.verify); the human listen-check is the backstop for wrong-voice / wrong-language renders, which are not auto-detected (no ASR by design).