Skip to content

Localization Pilot — Learnings Log

Pilot: DE + ES localization of the 10 per-type test sessions. Spec: docs/superpowers/specs/2026-06-11-localization-pilot-design.md Plan: docs/superpowers/plans/2026-06-11-localization-pilot.md Status: Translate + gate complete (20/20 files gate-clean). Audio renders, live Directus publishes, and the idempotency demo are deferred (await PR A merge + plugin reinstall — see "Cache/release" below).

This log is the deliverable that feeds the Phase-5 localization design.

Headline finding — the gate stack was English-only, and that was the real gap

The spec assumed an asymmetry of "DE gets the full humanize gate, ES skips it (no ai-patterns-es)." The truth was bigger: both the review and humanize skills hard-asserted locale == en and stopped on anything else, and the deterministic analyzer review.py only carried English pattern data. So neither DE nor ES could be gated — the German content (ai-patterns-de.md, voices) existed, but the tooling was never made locale-aware.

We turned that gap into the pilot's main tooling outcome (in naluma-ai-marketplace, PR feat/publish-non-en-locales):

  • ai-patterns-es.md authored (greenfield) — European Spanish, informal , grounded in research (see below).
  • review.py made locale-aware — ported DE + ES banned-pattern data into the locale-keyed dicts (BANNED_VOCAB, ALLOWED, medical-hype, BANNED_CONSTRUCTIONS, OPENERS, cliches) verbatim from the docs, lowercased; +8 tests (DE+ES), 23 green.
  • humanize + review SKILL.md de-gated — supported locales are now en/de/es; they stop only on locales with no ai-patterns doc (fr/pt).

Net: the spec's DE-vs-ES asymmetry is gone — both DE and ES now have full, gated support. The pilot upgraded the tooling instead of working around it.

Research grounding (per the user's mid-pilot request)

Are there multilingual AI-pattern sources? Yes:

  • Wikipedia "Signs of AI writing" is English-only — no DE/ES editions exist. (It's the basis of the /humanizer skill.)
  • Liang et al., "AI-Associated Lexical Shifts Across 34 Languages" (arXiv 2605.25358, 2026) — the strongest source; empirically ranks AI-overused words per language, incl. DE/ES. Key insight: overuse converges by concept, not literal word (LLMs over-favour "emphasising / importance / innovation" in each language's own vocabulary).
  • El Economista (2024) — concrete ES frequency data (crucial ~6,400× more frequent in AI text; fundamental, esencial, desafíos, resonar, dinámico…).

ai-patterns-es.md was re-grounded in these (added destacar, enfatizar, realzar, resonar, dinámico, fundamental, invaluable, elevar; openers En conclusión / En el contexto actual; fixed a self-inflicted subrayar → destacar bug since destacar is itself AI-flagged). The DE doc was left as-is (user decision) — note several study-flagged German words (Bedeutung, Notwendigkeit) are everyday nouns that would over-fire if hard-banned; only verb/filler tells are safe to add.

What the gate caught (it earns its keep)

All 20 translations ended gate-clean, but not all started clean:

  • Em dashes: reframe DE had 4 — caught and fixed (→ deterministic 100). German uses the en-dash Gedankenstrich (–, U+2013) legitimately; the analyzer counts only em dashes (—, U+2014), which is the right call here.
  • Description caps (≤120): DE/ES run longer than EN. how-habituation DE was 141→117, ES 130→111; three-minute-breathing-space ES 123→108. The cap is a real constraint for localization — literal translations routinely overshoot. (Enforced at publish by validate.py, not by review.py — the translator must check it.)
  • A session.<locale>.md authoring trap: a trailing --- after the body makes the YAML parse as two documents and the gate errors (expected a single document). Found in the first hand-authored DE file; baked into the subagent conventions thereafter.

Known limitation — readability is English-calibrated

review.py's fk_grade (Flesch-Kincaid via textstat) is English-tuned. DE/ES consistently exceed the 8.0 cap — ES especially (9–14) due to longer inflected words — even on clean, well-written prose. This pulled deterministic_score down (ES floors ~79–94) without indicating a real style problem. Phase-5: make readability locale-aware (textstat language config, or a per-locale cap, or drop FK for non-EN). The banned-pattern checks (the meaningful signal) ported cleanly and are reliable.

Translation conventions established (carry to FR/PT + future content)

  • Register: German informal du; European Spanish informal (parallels the brand's warm, non-clinical voice).
  • Plain-before-clinical: gloss glossary terms on first use (Habituation – die Gewöhnung; terapia de reentrenamiento del tinnitus (TRT)); exempt Naluma/tinnitus.
  • Fidelity = factuality: a translation inherits the EN parent's factuality (the published EN already passed); the review skill's factuality is source-fidelity, which holds for faithful translations. We set factuality: 96 (inherited).
  • Structural frontmatter copied verbatim (placement fields can't fork); only text fields re-authored; locale: + status + gate_scores changed.
  • Audio: translate the narration script, keep every pause_after_ms byte-identical (incl. the breathing intro's 3000 prep pause); content.pattern is never translated (loop cues are generated — see Phase 1).
  • Open ES nuance: welcome-to-naluma ES used feminine forms (lista/juntas) where EN was gender-neutral — flag for a native ES review (Jens reviews DE; ES needs a native eye, which the research grounding partially substitutes for).

Phase 1 + 2 (done, tested)

  • Phase 1 — breathing loop-cue localization (naluma-app-content/audio_pipeline): build_loop_plan(pattern, locale) + per-locale PHASE_WORD/NUMBER_WORD (en/de/es); threaded through the render. 61 audio tests green. fr/pt raise (Phase 5).
  • Phase 2 — non-EN publish unblock (naluma-ai-marketplace): publish finds the EN parent by slug and writes only that locale's translation row; idempotent upsert-on-r2_key (closes #46, pending the demonstration).

Cache/release subtlety (blocks the live steps in-session)

The Skill tool loads the installed plugin cache (still the old EN-only publish/humanize/review); our changes live in the repo branch. So in this session we ran the tooling directly (review.py --locale de/es) for gating — which worked end-to-end — but could not invoke the updated publish skill. To do the live publishes: merge PR A, bump the plugin version, reinstall, then run the real publish skill. (Alternatively refresh the cache locally — declined as hacky.)

Live run — COMPLETED (after PR A merged + plugin reinstalled to 0.4.0)

  • 8 audio renders done (3 audioGuided × de/es → <locale>.m4a + transcript + manifest; breathing × de/es → {intro,loop,outro}.m4a + manifest). All clean — the Phase-1 DE/ES breathing cues rendered without error (the loop.py localization works end-to-end).
  • All 20 DE/ES translation rows published live to Directus — every session has en+de+es, one parent each, no duplicates. Audio sessions carry per-locale audio_asset rows + transcript (audioGuided) / content.audio (breathing). Verified the path on the first one (the-habituation-model DE, row df0ffbe1): the locale-aware publish (0.4.0) wrote only the de translation row on the existing EN parent 64b224ad, never a second parent.
  • Idempotency demonstrated (closes #46). Re-published welcome-to-naluma DE: re-running r2_upload returned the same r2_key+sha (object unchanged), the asset lookup reused 4fb820b1, and the translation upsert updated row f0e4bfb3 in place. After counts: 1 parent / 1 de row / 1 de.m4a asset — no duplicates. This is the empirical evidence behind #46.

Still open

  • A native Spanish review of the 10 ES translations (esp. the welcome-to-naluma gendered-forms flag) — no native ES reviewer was in the loop; the research grounding is a partial substitute, not a replacement.

Phase-5 implications (the design this pilot feeds)

  1. Locale-aware readability (or drop FK for non-EN) — the one weak spot.
  2. Ship the locale-aware skills — version bump + reinstall flow.
  3. FR + PT — still fully greenfield: no ai-patterns-{fr,pt}, no loop.py cue words (they raise). Same playbook as ES.
  4. Native ES review in the loop (DE is covered by Jens).
  5. A structural-field EN-parity validator (deferred from this pilot as YAGNI; reconsider at scale).
  6. Per-locale voice nuance (e.g. gendered forms) — author guidance + native check.