Back to blog
·engineering

Building On-Device Speech Recognition with MLX on Apple Silicon


OnType's on-device speech recognition is powered by mlx-swift-asr, an open-source Swift library we built to run Qwen3-ASR natively on Apple Silicon. No Python runtime, no subprocess bridging, no cloud — just Swift → MLX → Neural Engine.

This post covers how the inference pipeline works, the latency engineering that makes it feel instant, and the non-obvious pitfalls we hit along the way.

Why Qwen3-ASR, why MLX

Qwen3-ASR is a speech recognition model from Alibaba's Qwen team. We ship the 0.6B-parameter variant, quantized to 6-bit — about 400MB on disk. It uses a Whisper-style audio encoder (Conv2d → Transformer) feeding into a Qwen3 text decoder with Grouped Query Attention, RoPE, and SwiGLU activations.

Apple's MLX framework is purpose-built for ML inference on Apple Silicon. The key advantage over general-purpose runtimes: MLX understands the unified memory architecture of M-series chips. CPU, GPU, and Neural Engine share the same memory pool — no copying data between host and device memory. For a real-time ASR pipeline where audio flows continuously and intermediate representations are consumed immediately, this eliminates an entire class of latency overhead.

We use mlx-swift, the native Swift bindings, so the entire path stays in Apple's ecosystem: Swift → MLX → Metal → Neural Engine. No Python interpreter in the loop.

The inference pipeline

The audio-to-text path in mlx-swift-asr has five phases:

Phase 1: Mel spectrogram

Raw audio at 16kHz is converted to a 128-bin log-mel spectrogram. The parameters match WhisperFeatureExtractor exactly: 400-sample FFT window (25ms), 160-sample hop (10ms), Slaney-style mel filterbank normalization.

The Hann window and mel filterbank matrices are computed once at startup and cached as static properties. This is a small optimization (~10ms saved per transcription), but in a streaming pipeline where every chunk goes through this path, it adds up.

One non-obvious compatibility detail: we drop the last STFT frame to match PyTorch's torch.stft(center=True) behavior. Off-by-one frame count causes dimension mismatches in the encoder — the kind of bug that takes hours to track down because the model doesn't crash, it just outputs garbage.

Phase 2: Audio encoding

The mel spectrogram feeds into the audio tower — three stride-2 Conv2d layers followed by a Transformer encoder. This compresses the time dimension by ~8x and projects audio features into the text decoder's hidden dimension.

A subtlety: the Conv2d weights need transposing from PyTorch's OIHW layout to MLX's OHWI layout during model loading. The weight sanitization step handles this automatically.

Phase 3: Prompt construction

The encoded audio features are merged into a text prompt following Qwen3's chat template format. Audio placeholder tokens in the prompt get replaced with actual encoder output embeddings using a cumsum-based indexing scheme. This is the same architecture that multimodal LLMs use for image tokens.

Phase 4: Token generation with double-buffering

This is where the interesting performance engineering lives. Autoregressive decoding generates one token at a time. The naive approach — forward pass, extract token, next forward pass — is serial. The GPU computes while the CPU waits, then the CPU extracts the token while the GPU sits idle.

We use a double-buffer asyncEval pattern to overlap GPU and CPU work:

  1. Queue the next forward pass before extracting the current token
  2. Call item() to extract the token ID — this forces a GPU sync, but the GPU is already computing the next logits in parallel
  3. By the time we need the next logits, they're often already materialized

The code reads counterintuitively: you prepare the next step's input embeddings, queue the forward pass and asyncEval, and only then extract the current token. But this pipelining hides the synchronization latency that would otherwise dominate decoding time.

Phase 5: Token decoding and cleanup

Generated token IDs are decoded via BPE and run through output parsing that strips special tokens and extracts detected language. The cleaned text then goes through our inverse text normalization engine before reaching the cursor.

Metal warmup: the 5-second cold start

MLX compiles Metal compute kernels at runtime — JIT compilation. The first transcription after app launch incurs ~5 seconds of shader compilation overhead. Subsequent transcriptions are fast.

This is the single most impactful performance pitfall in the entire library. Without warmup, the first real transcription takes 5–8 seconds for 8 seconds of audio (~0.5x real-time). With warmup, every transcription runs at 4–6x real-time.

Our warmup strategy has a few non-obvious details:

  • Use noise, not silence. Silence produces near-zero mel values that may not exercise all kernel paths. Low-amplitude random noise ensures diverse computations across the full pipeline.
  • Use realistic audio length. We warm up with 8 seconds of audio so the batched Conv2d encoder processes ~8 chunks, matching realistic batch dimensions. With only 2 seconds (~2 chunks), the first real transcription would still trigger additional Metal pipeline state compilation for larger batches.
  • Use non-zero temperature on first run. With greedy decoding (temperature=0), the model emits EOS immediately on noise input — zero tokens generated, leaving the autoregressive decode loop's kernels uncompiled. Temperature=1.0 forces token generation to exercise the full decode path.
  • Two warmup passes. First with temperature sampling (compile all kernels), second with greedy decoding (validate fast path). Then clear the memory cache but keep the shader cache.

Benchmark numbers

Qwen3-ASR 0.6B at 6-bit quantization on an M1 Pro. These are inference-only numbers, excluding model load and Metal warmup:

Audio durationInference timeRTFSpeed
1.58s0.27s0.1725.8x real-time
2.56s0.41s0.1596.3x real-time
0.98s0.26s0.2643.8x real-time
1.19s0.29s0.2394.2x real-time

Typical RTF range: 0.15–0.27 (3.7–6.3x real-time). This is competitive with the Python MLX implementation while running as a native Swift library with no Python overhead.

For OnType's real-time voice typing use case, this means the ASR engine processes speech faster than it arrives. By the time the user releases the hotkey, most of the audio has already been transcribed. Only the final chunk needs processing, which takes under 300ms.

Guardrails

A few safety mechanisms we learned to add the hard way:

  • Duration-based token cap. Maximum generation length is capped at ceil(audioDuration × 20) + 64 tokens. Without this, pathological inputs can cause the model to generate thousands of tokens without emitting EOS, spinning the GPU indefinitely.
  • Repetition detection. If the same token repeats 10 times consecutively, we stop. This catches degenerate outputs where the model gets stuck in a loop.
  • Minimum audio length. Audio shorter than one FFT window (400 samples = 25ms) returns empty immediately. Without this check, the reflection padding in STFT crashes with an invalid range.
  • MLX error conversion. MLX defaults to fatalError for GPU errors. We wrap critical paths in withError to convert these into Swift throws, so the app can show an error in the HUD instead of crashing.

Open source

mlx-swift-asr is MIT-licensed and available at github.com/ontypehq/mlx-swift-asr. It's a standalone Swift package — add it to your Package.swift and you have on-device speech recognition with a three-line API:

let stt = try await Qwen3ASRSTT.loadWithWarmup(from: modelDirectory)
let result = try await stt.transcribe(file: audioURL)
print(result.text)  // "Hello, world."
print(result.rtf)   // 0.17 (5.8x real-time)

If you're building anything that needs speech recognition on Apple Silicon — voice input, transcription, accessibility tools — give it a try. Benchmarks are reproducible via swift test --filter Benchmark.

Try OnType to see mlx-swift-asr in action — hold a key, speak, release to type. On-device, real-time, private.