From RMS thresholds to a rhythm-aware state machine

Detecting breath phases sounds easy — until the user actually breathes.

Four months ago we shipped real-time audio biofeedback on iOS. A native Swift plugin, a clean RMS envelope, a threshold, a phase. The first engineering post told that story: how we bypassed WKWebView's broken Web Audio bridge with AVAudioEngine and open-sourced the result.

That was the easy part.

This post is about the harder part that followed: teaching the same app to reliably hear five different breathing techniques — coherence, 4-7-8, box breathing, the energy-breathing pattern, and the physiological sigh — across two audio themes, in kitchens with running dishwashers, in offices with ventilation fans, with nose-breathers and mouth-breathers, people who exhale like a whisper and people who exhale like a steam engine.

It's the story of how a system that "mostly works" became one that works the same way for every technique, for every user, every time. And of where machine learning quietly fits in.

The Setup

shii·haa's guided sessions are personalized. You breathe a technique freely first, the app measures your natural tempo, and then the guided breath-work adapts its absolute timings to your body while preserving the technique's ratio. A 4-7-8 session for you might become 3.2 s inhale / 5.6 s hold / 6.4 s exhale — the ratio is preserved, the tempo is yours.

For that to work, the detector must answer one question reliably:

"Is the user currently inhaling, exhaling, holding, or paused — right now?"

Four states. Five techniques. Two audio themes (a classical EQ and a softer "Universum" theme with different spectral tilt). Ambient noise. Microphones ranging from AirPods to a phone on a yoga mat.

Our first version solved this the obvious way: RMS envelope + a fixed threshold. Above threshold → active (inhale or exhale). Below → silent (hold or pause). A state machine cycled through the technique's phases in order.

For coherence breathing — a gentle, symmetric 5.5 breaths per minute — it was beautiful.

For everything else it was brittle.

What Broke

Three failure modes, captured on video, taught us what was actually wrong.

Failure 1 — The Quiet Exhale Tail

With a microphone 30 cm away, a relaxed exhale tapers off into a tail that's 6–8 dB quieter than the inhale peak. A single symmetric threshold th would catch the inhale cleanly and then drop the last 30 % of the exhale below threshold — the state machine would see silent, advance to "hold", and the real exhale would keep going into the next phase.

We fixed this with an asymmetric threshold. thHigh gates phase entry, thLow (about 65 % of thHigh) gates phase exit. Once you're inside an active phase, you stay there until the signal crosses the lower bar. This is standard Schmitt-trigger hysteresis — but you only realise you need it when you watch a real user's exhale die quietly under the mic.

Failure 2 — The Physiological Sigh That Never Finished a Cycle

The physiological sigh has three phases with ratio [1, 0.5, 3] — a short inhale, a shorter "top-up" inhale, and a long relaxed exhale. Our generic formula for the minimum valid cycle duration multiplied base-tempo × ratio-sum × safety-factor and came out at 6.75 seconds.

But a real physiological sigh takes 4–7 seconds. Every single cycle the detector completed was discarded by the guard if (cycleMs < techMinCycleMs) return;. The cycle counter sat at zero while the state machine, visibly, worked.

The fix was simple in code (a special case for the ein+ phase) and humbling in principle: you cannot derive a minimum cycle duration from the ratio alone. Physiology dictates the absolute floor. The ratio just tells you the shape.

Failure 3 — The State That Refused to Move

Box breathing. Continuous mouth breathing in the energy pattern (3:2 inhale/exhale, no hold). The signal was perfect — band energy 4.4, smoothed envelope 2.1, clearly above threshold. And yet the state machine sat in the "silent after inhale" slot and refused to advance to "exhale" for three, four, sometimes five seconds.

The bug was subtle. To move from an active phase back to silence, our machine required ACTIVE_CONFIRM_MS (600 ms) of uninterrupted above-threshold signal. With continuous mouth breathing the signal briefly dips between inhale and exhale for maybe 80 ms. That micro-dip reset the confirmation timer. The state machine would never accumulate 600 ms of cleanness in a row, so it would never trust that a phase change had happened.

This one required a real architectural shift.

The Architectural Shift: From Reactive Threshold to Rhythm-Aware State Machine

The deepest thing we learned is this: thresholds describe sound; rhythm describes breath. A detector that only reacts to instantaneous amplitude will always lag, misclassify, or stall when the signal is noisy. A detector that builds expectations from rhythm — where the valleys should be, how long a phase should plausibly last — can commit decisions earlier and more confidently.

Three changes followed from that insight.

1. The Prep Phase as Synchronization Anchor

Instead of dropping users straight into a free-breathing session, the detector now inserts a 3-second prep phase before active detection starts. The UI says "Atme noch einmal ganz aus…" and the app uses those 3 seconds to:

Measure the last exhale peak and its spectral centroid — a per-user, per-session acoustic fingerprint stored as _userExhaleCentroid and used downstream to disambiguate breath from speech or ambient noise.
Establish a deterministic mapping: active1 = inhale, active2 = exhale. No more guessing which direction the first phase should be.

Prep is enabled for coherence, 4-7-8, box, and energy breathing. It's skipped for the physiological sigh because starting with a "last exhale" would be physiologically nonsense (a sigh begins with inhale on top of inhale).

The UX cost is 3 seconds. The technical benefit is enormous: every downstream gate can assume the first active phase is an inhale, full stop.

2. The Valley-Rhythm Gate

For techniques without holds — coherence, the active phases of the energy pattern — we replaced "change phase when signal crosses threshold" with "change phase when:

the envelope shows a local minimum (a valley), and
at least 60 % of the expected phase duration has elapsed."

The expected phase duration comes from the user's own measured tempo during the free-breathing segment, multiplied by the technique's ratio. A user with a natural 3.2 s inhale doing coherence gets a gate that won't fire for the first 1.9 s regardless of signal wobble. After 1.9 s, the first real valley commits the transition.

This one change eliminated a whole class of "phase flaps" we had been papering over with longer smoothing windows.

3. The Valley-Rescue for Stuck States

For the energy-breathing failure mode (state stuck in silent1), the valley detector pulls double duty. If the machine is sitting in silent1 or silent2, we are past the active phase (the user is clearly in the transition), and the active phase reached ≥ 60 % of its target duration, a valley triggers an immediate rescue transition. No 600 ms confirmation required. The signal has already told us what happened; we just trust it.

Valley-rescue is gated on !techHasHold — for 4-7-8 and box breathing, where silent phases are supposed to last a while, we never rescue. You don't want the app to skip the 7-second hold in 4-7-8 because the microphone is too quiet.

A Small Code Tour

The valley-rhythm gate, stripped to its essence:

// During active phases: only commit a transition at a valley,
// and only if we've actually completed enough of the phase.
if (isActive && valley && !techHasHold) {
  const phaseElapsed = now - _phaseStartMs;
  const minPhaseDuration = _targetPhaseMs * 0.60;
  if (phaseElapsed >= minPhaseDuration) {
    advancePhase();        // commit
  }
  // otherwise: it's a valley, but too early — ignore.
}

The prep phase, emitting the synchronization anchor:

// prep state, last 300ms before free-breathing begins
if (_state === 'prep' && prepTimeLeft < 300) {
  if (envelope > peakSoFar) {
    peakSoFar = envelope;
    _userExhaleCentroid = currentCentroid;  // spectral fingerprint
  }
}

The physiological-sigh minimum cycle guard:

// Generic formula: techMinCycleMs = baseTempo * ratioSum * 1.5
// But physiology is not generic.
if (techHint.phases.indexOf("ein+") >= 0) {
  techMinCycleMs = 2500;  // floor for sighs — 4-7s typical, 2.5s absolute min
}

The actual implementation in biofeedback.js (now at version 394) is longer and has more edge cases. The shape of the ideas is what matters.

Where Machine Learning Fits — Honestly

We have a machine learning module in production. We also do not yet let it drive decisions. Both of those facts are intentional — and the reasoning became the foundation for how we plan to publish this work.

The two-stage framing. The rule-based detector is not just the shipping product. It is also, deliberately, a qualified labeling substrate for the ML that comes next. We first validate the detector against a clinical reference. Then — and only then — we use the labels it produces to train a personalized model, on a dataset whose label quality we have already measured. This inverts the standard "unvalidated labels in, unvalidated model out" loop that limits most mobile-audio ML work. Details in the research pitch (PDF, German).

What exists today:

breath-ml.js, an adaptive module that runs an AudioWorklet at 16 kHz, window size 1 s, hop 10 ms, mel-spectrogram parameters matching YAMNet (64 mel bands, 60–7800 Hz).
Real 1024-dimensional YAMNet embeddings extracted per second and stored in IndexedDB, labelled inhale / exhale / pause / noise.
A trainable classification head — Dense(1024 → 64, swish) → Dropout(0.3) → Dense(64 → 4, softmax) — trained on the user's own data once a minimum sample count per class is reached.
A quality-tiered label system: Gold (BLE-HRV-confirmed + high confidence + plausible), Silver (plausible + confidence > 60 % + phase > 1 s), Bronze (used only for diagnostics). Training uses Gold + Silver.
An A/B shadow evaluator that runs the ML classifier alongside the rule-based detector in real time, logs agreement and disagreement, and never — today — overrides the state machine.

The reason ML is shadow-only right now is empirical and humble: the rule-based detector, after the fix-chain described above, reaches 94–99 % cycle regularity in our field tests. A classifier that sometimes agrees and sometimes disagrees cannot improve that without a principled arbitration policy. We would rather ship a reliable deterministic system and let the model earn its decisions through measurable lift than flip a switch and debug regressions in the wild.

What the ML is already good at today, even in shadow:

Technique identification from breathing alone. Given 20 seconds of audio, a lightly trained head distinguishes coherence from box from energy breathing with high confidence. This is what the Stimmungs-Check will use in the next phase.
Personal baseline drift detection. The same user's "relaxed" breath sounds different at 7 a.m. and after a long meeting. The embeddings space captures that; thresholds do not.

What comes next:

Phase 2b — duration correction for passive breathing (target: summer 2026). Our current detector under-estimates inhalation duration when users are truly relaxed (because passive inhalation is acoustically quiet). YAMNet embeddings carry that information even when the RMS doesn't. Closing this gap is the highest-priority ML integration.
Clinical validation study (target: autumn 2026, pending ethics approval). A prospective single-arm study, N ≈ 30, in Switzerland. Primary endpoint: balanced accuracy against a synchronized chest-belt reference. Secondary endpoint: per-cell Cohen's κ to qualify the label substrate for Phase 2.
On-device technique suggestion. Replace the hand-written recommendation rules in the Stimmungs-Check with a learned policy over the embedding space.
Sleep-breathing detection. A separate problem — quieter, longer windows, no attention from the user — but the pipeline is ready.

The architecture decision that unlocks all of this: the ML module doesn't replace the state machine — it informs it. The state machine is the fallback, the ground truth, the thing we can reason about. The ML module is a probabilistic sensor that the state machine can consult when the signal is ambiguous. That's the only integration model we're willing to ship on a biofeedback device that people trust with their nervous system.

The validation study we are now planning has two pre-specified aims. Primary: quantify per-phase detection accuracy against a synchronized chest-belt reference (Polar H10 + Vernier Go Direct Respiration Belt, 50 Hz) in three nested acoustic settings, including scripted kitchen-style noise. Secondary: quantify Cohen's κ between the rule-based labels and the chest-belt reference, per technique × setting cell. Cells that clear a pre-specified κ threshold (lower bound of 95% CI ≥ 0.85) are declared ML-ready and feed the M3 personalized-continual-learning study that follows. Cells that don't are excluded from ML training and become targeted engineering work.

Framed this way, the study does two things with one cohort. It produces the accuracy number the product needs. It also produces the qualified label dataset the next paper needs. That is the efficiency we were willing to wait six months of engineering to earn.

Lessons

For other Capacitor / mobile audio developers wrestling with real-world signal:

Thresholds model sound, not breath. Any detector that only reacts to instantaneous amplitude will eventually stall on a user who doesn't breathe like your test data. Build rhythm expectations as soon as you can.
Asymmetric thresholds are free performance. Schmitt-trigger hysteresis costs two constants and eliminates an entire category of phase flaps.
Synchronization anchors beat heuristics. A 3-second prep phase that explicitly measures the user beats any amount of "guess which phase is starting" logic.
Ratios are not enough to derive minimum cycle duration. Absolute physiological floors must be encoded per technique. Discovering this cost us a day.
Ship the deterministic system first. Let the model prove itself in shadow. Premature ML integration in a real-time feedback loop is an anti-pattern.

Open Source, Again

The native iOS audio plugin is still open: @shiihaa/capacitor-audio-analysis — MIT licensed.

The detector logic described here lives in the shii·haa app itself (closed source for now — it's the core of the product), but the architectural patterns are general. If you're building real-time breath or voice detection on a constrained device, the three techniques above (prep-phase anchor, valley-rhythm gate, valley-rescue) are a starting point that will save you weeks.

If you're working on something similar and stuck, I'm always happy to look at an oscillogram. Felix — felix@shiihaa.app.

"shii… haa."