Back to blog
·engineering

From Spoken Words to Clean Text: How Inverse Text Normalization Works


Speech recognition models output words. But when you dictate "the meeting is on January fifteenth at two thirty PM and the budget is three thousand dollars", you don't want to see those exact words. You want: "The meeting is on January 15th at 2:30 PM and the budget is $3,000."

The transformation from spoken form to written form is called inverse text normalization, or ITN. It's the opposite of what a text-to-speech system does (which converts written text to spoken form). And it's one of those features that's invisible when it works and infuriating when it doesn't.

What ITN covers

ITN handles a broader range of transformations than most people realize:

  • Numbers. "forty two" → "42", "three point one four" → "3.14", "negative seven" → "-7"
  • Currency. "three thousand dollars" → "$3,000", "fifty euros" → "€50"
  • Dates and times. "January fifteenth twenty twenty six" → "January 15th, 2026", "two thirty PM" → "2:30 PM"
  • Ordinals. "the third item" → "the 3rd item"
  • Units. "five kilometers" → "5 km", "twenty degrees celsius" → "20°C"
  • Punctuation. "comma" → ",", "period" → ".", "question mark" → "?"
  • Voice commands. "new line" → actual newline, "colon" → ":"

Each of these seems simple in isolation. The complexity comes from ambiguity. Does "one" mean the number 1, or the pronoun? Does "May" mean the month, or the modal verb? Does "dash" mean a hyphen, or the word "dash"?

Why we built our own engine

Most ASR providers include basic ITN in their cloud pipeline. But when you're doing on-device processing, you need on-device ITN too — and the options are limited. Existing open-source ITN libraries are typically Python-based, designed for batch processing, and focused on English.

We needed something different:

  • Real-time performance. ITN runs on every chunk of streaming transcription. It needs to process text in microseconds, not milliseconds.
  • CJK support. Chinese and Japanese have completely different number systems, punctuation conventions, and formatting rules. "三千美元" needs to become "$3,000" just like "three thousand dollars" does.
  • Native integration. We need a library that runs in a Swift macOS app without bundling a Python runtime or bridging through a subprocess.

Finite state transducers

Our ITN engine is built on finite state transducers, or FSTs. An FST is a state machine that reads an input sequence and produces an output sequence. For ITN, the input is a sequence of spoken words and the output is the normalized written form.

The key advantage of FSTs over regex or rule-based string replacement is composability. You can build small FSTs for individual transformations — one for cardinal numbers, one for dates, one for currency — and compose them into a single transducer that handles all cases simultaneously, with deterministic priority ordering when rules overlap.

We wrote the FST runtime library, libfst, in Zig. Zig gives us a few things that matter for this use case: C ABI compatibility (so the library links directly into the Swift app), zero-cost abstractions for the state machine transitions, and precise control over memory allocation — no garbage collector pauses in the middle of real-time text processing.

The rule compilation pipeline

ITN rules are authored in Python as declarative transformation specifications. Each rule describes a pattern (input words) and a replacement (output text). The Python toolchain compiles these rules into binary FST files — compact, optimized transducer representations that the Zig runtime loads and executes.

This split lets us iterate on rules quickly — Python is great for expressing linguistic patterns — while keeping the runtime fast. Adding a new number format or currency symbol means editing a Python rule file and recompiling. The runtime binary doesn't change.

Currently we ship compiled FST rule files for Chinese, Japanese, and English. Each language has its own set of rules because the normalization conventions differ substantially. Japanese uses full-width punctuation and different counter words. Chinese numbers follow a base-10,000 grouping instead of base-1,000. English has its own quirks with ordinals and twelve-hour time.

Voice commands

ITN also handles voice commands — spoken phrases that map to keyboard actions rather than literal text. When you say "new line" during dictation, you want an actual line break, not the words "new line." When you say "comma," you want the punctuation mark.

This creates an interesting disambiguation challenge. The ITN engine needs to determine whether "new line" is a voice command or part of a sentence like "this is a new line of products." We handle this through context: voice commands are recognized in specific syntactic positions (typically after a pause or at clause boundaries), and the FST's priority ordering ensures that command interpretations are preferred in ambiguous positions.

For CJK languages, voice commands are localized. Chinese users say "换行" for newline and "逗号" for comma. The FST rules are language-specific, so each locale has its own natural command vocabulary.

The invisible feature

Good ITN is invisible. You dictate naturally and the text looks right. Numbers are numbers, dates are dates, punctuation is where it should be. The moment you have to manually correct a "$three thousand" or delete the literal words "new line" from your text, the illusion breaks.

We put significant engineering effort into a feature that, when it's working perfectly, nobody notices. That's the point. The best voice typing experience is the one where you forget you're not typing.

Try OnType — on-device voice typing with smart text normalization for English, Chinese, and Japanese.