FFT Window Tiling — 0.5 s utterance

Illustrative only — the spectra shown here are mathematically synthesized to demonstrate FFT concepts (window tiling, PMF normalization, centroid, rolloff). They are not measurements of any real or TTS utterance. The actual varṇa frequency distributions in the Sparśa Grid are computed from audio using identical parameters.

8000 vs 1024 — 8000 is the full recording (0.5 s × 16 kHz). The FFT never sees all 8000 at once — it takes a contiguous 1024-sample slice, computes the spectrum, then slides forward 512 samples and repeats. 1024 = 2¹⁰ is chosen for FFT algorithm efficiency, not derived from 8000. The two numbers interact only through: ⌊(8000 − 1024) / 512⌋ + 1 = 14 frames.

sample tape — 8000 samples · slider selects frame for per-frame panels below

Grey bar = all 8000 samples in time order. Blue bracket = the 1024-sample window for the selected frame — contiguous, sliding right by 512 each hop. Amber = 512 samples shared with the previous window. The power(n) and per-frame PMF panels respond to this slider. The mean PMF panel always shows all 14 frames.

frame W7

power(n) — raw FFT magnitude² · selected frame only

power(n) = |X[n]|²

Each of the 513 bins gets a squared-magnitude value. The y-axis is in arbitrary energy units. The dynamic range is large — noise floor bins are tiny slivers beside the formant peaks, which is why normalisation matters.

F1 ~700 Hz F2 ~1400 Hz F3 ~2600 Hz

0 Hz50010001500200025003000 Hz

spectral PMF — per-frame · AUC = 1 · selected frame

p(n) = power(n) / Σ power · Σ p(n) = 1

This curve is a discrete probability mass function (PMF) over frequency: each bin value p(n) is the fraction of total spectral energy in that bin for this frame, and all bins sum to exactly 1. This is the freq-domain analysis sense in which "AUC = 1." It is not the same as a probability distribution of formant locations across time — that would be a histogram of F1/F2/F3 peak frequencies measured across frames.

0 Hz50010001500200025003000 Hz

Σ p(n) over displayed 0–3 kHz bins = —

mean spectral PMF — all 14 frames · AUC = 1

p̄(n) = (1/14) · Σ_f p_f(n) · Σ p̄(n) = 1

Average the per-frame PMFs bin by bin across all 14 frames. Because each per-frame PMF sums to 1, the mean also sums to 1. This is the whole-utterance frequency distribution . Centroid = frequency-weighted mean (where the "centre of mass" of energy sits). Rolloff 95% = frequency below which 95% of total energy lies.

0 Hz50010001500200025003000 Hz

centroid — Hz rolloff 95% — Hz Σ p̄(n) = —