Illustrative only — the spectra shown here are mathematically synthesized to demonstrate FFT concepts (window tiling, PMF normalization, centroid, rolloff). They are not measurements of any real or TTS utterance. The actual varṇa frequency distributions in the Sparśa Grid are computed from audio using identical parameters.
8000 vs 1024 — 8000 is the full recording (0.5 s × 16 kHz). The FFT never sees all 8000 at once — it takes a contiguous 1024-sample slice, computes the spectrum, then slides forward 512 samples and repeats. 1024 = 2¹⁰ is chosen for FFT algorithm efficiency, not derived from 8000. The two numbers interact only through: ⌊(8000 − 1024) / 512⌋ + 1 = 14 frames.
Each of the 513 bins gets a squared-magnitude value. The y-axis is in arbitrary energy units. The dynamic range is large — noise floor bins are tiny slivers beside the formant peaks, which is why normalisation matters.
F1 ~700 Hz
F2 ~1400 Hz
F3 ~2600 Hz
0 Hz50010001500200025003000 Hz
This curve is a discrete probability mass function (PMF) over frequency: each bin value p(n) is the fraction of total spectral energy in that bin for this frame, and all bins sum to exactly 1. This is the freq-domain analysis sense in which "AUC = 1." It is not the same as a probability distribution of formant locations across time — that would be a histogram of F1/F2/F3 peak frequencies measured across frames.
0 Hz50010001500200025003000 Hz
Σ p(n) over displayed 0–3 kHz bins = —
Average the per-frame PMFs bin by bin across all 14 frames. Because each per-frame PMF sums to 1, the mean also sums to 1. This is the whole-utterance frequency distribution . Centroid = frequency-weighted mean (where the "centre of mass" of energy sits). Rolloff 95% = frequency below which 95% of total energy lies.
0 Hz50010001500200025003000 Hz
centroid — Hz
rolloff 95% — Hz
Σ p̄(n) = —