Localization Pilot — Learnings Log¶
Pilot: DE + ES localization of the 10 per-type test sessions.
Spec: docs/superpowers/specs/2026-06-11-localization-pilot-design.md
Plan: docs/superpowers/plans/2026-06-11-localization-pilot.md
Status: Translate + gate complete (20/20 files gate-clean). Audio renders,
live Directus publishes, and the idempotency demo are deferred (await PR A
merge + plugin reinstall — see "Cache/release" below).
This log is the deliverable that feeds the Phase-5 localization design.
Headline finding — the gate stack was English-only, and that was the real gap¶
The spec assumed an asymmetry of "DE gets the full humanize gate, ES skips it
(no ai-patterns-es)." The truth was bigger: both the review and
humanize skills hard-asserted locale == en and stopped on anything else, and
the deterministic analyzer review.py only carried English pattern data. So
neither DE nor ES could be gated — the German content (ai-patterns-de.md,
voices) existed, but the tooling was never made locale-aware.
We turned that gap into the pilot's main tooling outcome (in
naluma-ai-marketplace, PR feat/publish-non-en-locales):
ai-patterns-es.mdauthored (greenfield) — European Spanish, informal tú, grounded in research (see below).review.pymade locale-aware — ported DE + ES banned-pattern data into the locale-keyed dicts (BANNED_VOCAB,ALLOWED, medical-hype,BANNED_CONSTRUCTIONS,OPENERS, cliches) verbatim from the docs, lowercased; +8 tests (DE+ES), 23 green.humanize+reviewSKILL.md de-gated — supported locales are now en/de/es; they stop only on locales with noai-patternsdoc (fr/pt).
Net: the spec's DE-vs-ES asymmetry is gone — both DE and ES now have full, gated support. The pilot upgraded the tooling instead of working around it.
Research grounding (per the user's mid-pilot request)¶
Are there multilingual AI-pattern sources? Yes:
- Wikipedia "Signs of AI writing" is English-only — no DE/ES editions exist.
(It's the basis of the
/humanizerskill.) - Liang et al., "AI-Associated Lexical Shifts Across 34 Languages" (arXiv 2605.25358, 2026) — the strongest source; empirically ranks AI-overused words per language, incl. DE/ES. Key insight: overuse converges by concept, not literal word (LLMs over-favour "emphasising / importance / innovation" in each language's own vocabulary).
- El Economista (2024) — concrete ES frequency data (crucial ~6,400× more frequent in AI text; fundamental, esencial, desafíos, resonar, dinámico…).
ai-patterns-es.md was re-grounded in these (added destacar, enfatizar, realzar,
resonar, dinámico, fundamental, invaluable, elevar; openers En conclusión / En el
contexto actual; fixed a self-inflicted subrayar → destacar bug since destacar
is itself AI-flagged). The DE doc was left as-is (user decision) — note several
study-flagged German words (Bedeutung, Notwendigkeit) are everyday nouns that
would over-fire if hard-banned; only verb/filler tells are safe to add.
What the gate caught (it earns its keep)¶
All 20 translations ended gate-clean, but not all started clean:
- Em dashes:
reframeDE had 4 — caught and fixed (→ deterministic 100). German uses the en-dash Gedankenstrich (–, U+2013) legitimately; the analyzer counts only em dashes (—, U+2014), which is the right call here. - Description caps (≤120): DE/ES run longer than EN.
how-habituationDE was 141→117, ES 130→111;three-minute-breathing-spaceES 123→108. The cap is a real constraint for localization — literal translations routinely overshoot. (Enforced at publish byvalidate.py, not byreview.py— the translator must check it.) - A
session.<locale>.mdauthoring trap: a trailing---after the body makes the YAML parse as two documents and the gate errors (expected a single document). Found in the first hand-authored DE file; baked into the subagent conventions thereafter.
Known limitation — readability is English-calibrated¶
review.py's fk_grade (Flesch-Kincaid via textstat) is English-tuned. DE/ES
consistently exceed the 8.0 cap — ES especially (9–14) due to longer inflected
words — even on clean, well-written prose. This pulled deterministic_score down
(ES floors ~79–94) without indicating a real style problem. Phase-5: make
readability locale-aware (textstat language config, or a per-locale cap, or drop FK
for non-EN). The banned-pattern checks (the meaningful signal) ported cleanly and
are reliable.
Translation conventions established (carry to FR/PT + future content)¶
- Register: German informal du; European Spanish informal tú (parallels the brand's warm, non-clinical voice).
- Plain-before-clinical: gloss glossary terms on first use (Habituation – die Gewöhnung; terapia de reentrenamiento del tinnitus (TRT)); exempt Naluma/tinnitus.
- Fidelity = factuality: a translation inherits the EN parent's factuality
(the published EN already passed); the review skill's factuality is source-fidelity,
which holds for faithful translations. We set
factuality: 96(inherited). - Structural frontmatter copied verbatim (placement fields can't fork); only
text fields re-authored;
locale:+status+gate_scoreschanged. - Audio: translate the
narrationscript, keep everypause_after_msbyte-identical (incl. the breathing intro's3000prep pause);content.patternis never translated (loop cues are generated — see Phase 1). - Open ES nuance:
welcome-to-nalumaES used feminine forms (lista/juntas) where EN was gender-neutral — flag for a native ES review (Jens reviews DE; ES needs a native eye, which the research grounding partially substitutes for).
Phase 1 + 2 (done, tested)¶
- Phase 1 — breathing loop-cue localization (
naluma-app-content/audio_pipeline):build_loop_plan(pattern, locale)+ per-localePHASE_WORD/NUMBER_WORD(en/de/es); threaded through the render. 61 audio tests green. fr/pt raise (Phase 5). - Phase 2 — non-EN publish unblock (
naluma-ai-marketplace): publish finds the EN parent by slug and writes only that locale's translation row; idempotent upsert-on-r2_key(closes #46, pending the demonstration).
Cache/release subtlety (blocks the live steps in-session)¶
The Skill tool loads the installed plugin cache (still the old EN-only
publish/humanize/review); our changes live in the repo branch. So in this
session we ran the tooling directly (review.py --locale de/es) for gating —
which worked end-to-end — but could not invoke the updated publish skill. To do
the live publishes: merge PR A, bump the plugin version, reinstall, then run the
real publish skill. (Alternatively refresh the cache locally — declined as hacky.)
Live run — COMPLETED (after PR A merged + plugin reinstalled to 0.4.0)¶
- 8 audio renders done (3 audioGuided × de/es →
<locale>.m4a+ transcript + manifest; breathing × de/es →{intro,loop,outro}.m4a+ manifest). All clean — the Phase-1 DE/ES breathing cues rendered without error (theloop.pylocalization works end-to-end). - All 20 DE/ES translation rows published live to Directus — every session has
en+de+es, one parent each, no duplicates. Audio sessions carry per-locale
audio_assetrows +transcript(audioGuided) /content.audio(breathing). Verified the path on the first one (the-habituation-modelDE, rowdf0ffbe1): the locale-awarepublish(0.4.0) wrote only thedetranslation row on the existing EN parent64b224ad, never a second parent. - Idempotency demonstrated (closes #46). Re-published
welcome-to-nalumaDE: re-runningr2_uploadreturned the samer2_key+sha (object unchanged), the asset lookup reused4fb820b1, and the translation upsert updated rowf0e4bfb3in place. After counts: 1 parent / 1derow / 1de.m4aasset — no duplicates. This is the empirical evidence behind #46.
Still open¶
- A native Spanish review of the 10 ES translations (esp. the
welcome-to-nalumagendered-forms flag) — no native ES reviewer was in the loop; the research grounding is a partial substitute, not a replacement.
Phase-5 implications (the design this pilot feeds)¶
- Locale-aware readability (or drop FK for non-EN) — the one weak spot.
- Ship the locale-aware skills — version bump + reinstall flow.
- FR + PT — still fully greenfield: no
ai-patterns-{fr,pt}, noloop.pycue words (they raise). Same playbook as ES. - Native ES review in the loop (DE is covered by Jens).
- A structural-field EN-parity validator (deferred from this pilot as YAGNI; reconsider at scale).
- Per-locale voice nuance (e.g. gendered forms) — author guidance + native check.