← WorkVoice Governance · AI Output Evals

Case study · AI · alignment

I engineered a linting framework that strips AI markers and enforces brand personas: evals for prose, not vibes.

The launch team shipped case study pages, about narrative, and product hero with zero internal codenames on public URLs. Slug-scoped lint caught regressions on one page without scanning the whole repo. Product docs inherited the same we-voice profile and codename ban on the next release cycle.

5 voice profiles · contextual lint · founder vs product tone

5 profiles: founder · product · brand layers
6 lint rules: vague · sleaze · AI markers · density
2,021: messages mined for voice gold

Why voice governance matters

People pattern-match AI copy fast, even when they cannot name why. Three low-specificity adjectives in a row, stacked transition words at sentence start, em dashes chaining every bullet: those are trust signals that fail before anyone reads the product spec.

The SaaS launch team came to me on a tight pre-launch window with agent-assisted drafts that saved time and created a new problem. Half the sentences could not ship: internal project names in case study titles, coaching tone on a founder story, hype without a number behind it. Manual re-reads do not scale across a marketing site, a product hero, and multiple case studies shipping together.

The fix is eval infrastructure for prose. Structured profiles define register. TellLint scores every read against density and ban rules. The client team keeps creative control through diff-review; automation catches the regressions that slip through at 11pm before a deploy.

Ship gate pipeline

Figure 1 walks the pipeline left to right. Stage one loads the active voice profile for the surface: founder and case study pages pull I-voice and enterprise-trust density; product docs pull we voice and feature clarity rules. Stage two runs TellLint against the full read or a slug-scoped excerpt when I am editing one case study block.

Stage three emits pass, warn with line references, or block. Blockers include internal codenames on public URLs and AI marker clusters without supporting metrics. Stage four is optional rewrite: a local model rephrases flagged spans while anchors keep facts and numbers intact. Stage five is human diff-review; I re-lint the merged text before anything touches production.

The loop is deliberately boring. Boring gates are what let me ship agent-assisted copy without apologizing for how it reads.

Lint to publish

1 · Profile load, TellLint scan, score/warn, optional rewrite, human review, re-lint, publish.

Lint rule stack

TellLint does not maintain a naive banned-word list. Contextual rejection catches patterns that read wrong for the surface even when each word alone is fine. Figure 2 stacks rules by severity: block ship, warn with line refs, or style nudge.

Block tier stops internal codenames, sandbox filenames, and repo paths on public launch URLs and case study titles. Methodology and outcomes are welcome; product codenames and operator jargon are not. AI marker clusters fire when template transitions pile up: chained openers, hype-word pairs in the same short span, meta disclaimers that explain the copy instead of making a claim.

Warn tier covers em-dash budget (~1–2 per page typical, max ten per single read), vague stacks (three or more low-specificity adjectives without mechanism), and filler density (ensure, robust, comprehensive repeating in a tight window). Style tier enforces I-voice on founder and case study copy and we voice on product brands.

Sleaze and gimmick detection catches manipulative edu-marketing tone: fake urgency, unearned intimacy, passion without proof. Each rule returns structured findings so rewrite passes target spans instead of rewriting whole pages blind.

No fixed word bans; contextual scoring per profile
Slug-scoped mode lints one case study block during authoring
Site-wide mode guards launch pages, metadata, and support chat responses
Em-dash cap enforced in prose, bullets, and figure captions
Findings include line refs for operator diff-review

Rule tiers

Lint rule stack diagram with block, warn, and style tiers for codename ban, AI markers, em-dash budget, vague density, and I-voice — 2 · Block, warn, and style tiers: codenames, AI markers, em-dash budget, vague density, I-voice.

Profile hierarchy

Five structured profiles collapse into two registers at lint time. Figure 3 contrasts founder register against product register. Case study pages, about narrative, and founder story use I voice, short direct sentences, and fact-backed outcomes. The bar is prove credibility in sixty seconds without sounding like a template.

Product register covers plugin docs, brand surfaces, and feature marketing with we voice and empathy-forward clarity. A cross-rule layer applies to both: no explaining-the-explanation, no people-person clichés, intensity tied to verifiable facts.

Profiles are JSON contracts, not vibe docs. TellLint loads pronoun rules, density targets, and ban lists before the first scan. When I edit one case study, slug-scoped lint uses the founder profile only on that block so product copy elsewhere does not noise the run.

Transcript mining fed a gold corpus: two thousand-plus messages parsed for phrasing that reads human under pressure, not phrases models default to when asked to sound professional.

Founder vs product

3 · Founder register (I voice, enterprise-trust) vs product register (we voice, builder clarity).

Before / after launch copy

Figure 4 shows NovaGrid Analytics, a fictional SaaS launch excerpt. The before panel fails five block rules in one paragraph: AI markers, vague stack, em-dash chain, second-person coaching, and an internal codename smuggled into public copy.

The after panel keeps the same company but ships launch-ready: I voice, a concrete mechanism (token-budgeted ingest pipeline), countable rules (six lint rules), and an outcome with proof (fewer review rounds, zero codenames on public URLs). That is the rewrite contract: same facts, different register, lint-clean on re-run.

I use panels like this in review meetings when stakeholders ask why we cannot just let the model draft. The answer is not taste; it is eval scores and diff gates.

Eval panel

Side-by-side before and after copy panel for fictional NovaGrid Analytics case study showing lint flags on AI slop versus pass on I-voice launch copy — 4 · Fictional excerpt: AI slop flags vs lint-passing I-voice with metrics and mechanism.

Pipeline stages in production

Every ship follows the same sequence: lint, score or warn, optional rewrite, re-lint, human approve. On case study authoring I run slug-scoped lint against the block I am editing so feedback stays local. Before their launch deploy I ran site-wide lint on marketing routes, Open Graph metadata, and the product hero.

The optional rewrite pass calls a local model when the operator machine has inference up. Rewrite prompts include the active profile, flagged spans, and anchor facts that must survive (metrics, shipped URLs, role titles). Output never merges without diff-review; I have seen models swap correct numbers for plausible wrong ones.

Re-lint is non-negotiable after rewrite. Human edits reintroduce em dashes and filler as often as models do. The second pass is cheap compared to an investor bouncing because the about page sounds like ChatGPT.

lint → score/warn → rewrite (optional) → re-lint → ship
Slug-scoped: one case study block during authoring
Site-wide: launch pages, metadata, support chat grounding
Rewrite anchors: metrics, URLs, titles cannot drift
Codename replacement list for operator diff-review

Production proof

Their launch packet is the proof: case study pages, about narrative, and product hero all passed eval gates before the site went live. Prose references figures in order, captions use numbered claims, and slug-scoped lint caught a codename regression before merge. The context architecture case study cross-links here because voice lint modules load on demand when agents touch launch copy; context decides retrieval, voice decides what ships.

Their support chat retrieves case study chunks with the same hybrid ranker I describe in context architecture, then applies founder-register rules to answers so visitors get cited proof without internal names. DevFlow agent IDE sessions load voice lint modules when tasks touch marketing or case study files.

The rewrite queue prioritized P0 launch surfaces first: home hero, about narrative, founder story bullets, then case study slugs. Product brand pages followed with we-voice profiles on the next sprint.

Interview depth & alignment framing

Teams increasingly ask how I eval LLM output without hand-waving. I whiteboard this engagement: profile contracts, rule tiers, slug-scoped vs site-wide lint, rewrite anchors, and human gates. Same vocabulary as RLHF and LLM-as-judge pipelines, applied to prose on URLs stakeholders actually open.

TellLint is the technique name I use for the rule engine; it is not a product brand on public surfaces. The implementation is a Node lint script plus profile JSON, runnable in CI or pre-commit.

Talking points I can defend with code: em-dash budget math, codename ban list sourcing, contextual vague-stack detection without brittle word lists, and why re-lint after human edit matters.

Lessons & what's next

Eval beats editorial prayer. Once lint runs in the ship path, I stop debating whether a sentence feels AI; I read the score and fix line refs.

Slug-scoped lint made case study expansion tractable. Writing long-form proof for a client launch would be reckless without a gate that only scans the block I am editing.

Rewrite without anchors is worse than no rewrite. Models polish tone and accidentally delete the metric that made the sentence credible.

Voice and context are complementary systems. Skipping voice governance on an agent with perfect retrieval still ships codenames; skipping context on perfect lint still ships stale facts.

Next: CI hook on client marketing PRs for launch-copy lint, support chat eval suite so every cited slug passes before the answer renders, and a public extract of the profile schema for teams building their own copy gates.

About: how I work Context architecture All work