Model Comparison: Voicebox, E2, and F5
Voicebox, E2, and F5 are three influential non-autoregressive TTS models that were built on each other: E2 was based on Voicebox, and F5 was in turn based on E2. Here, I’ll compare the three models and explain how and why E2 and F5 diverged from their predecessors.
The papers:
- Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (2023)
- Code (unofficial): https://github.com/lucidrains/voicebox-pytorch
- E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS (2024)
- Code (unofficial): https://github.com/lucidrains/e2-tts-pytorch
- Code (unofficial, F5 authors’ reproduction): https://github.com/SWivid/F5-TTS, see
unett.py
- F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching (2024)
- Code (official): https://github.com/SWivid/F5-TTS, see
dit.py
- Code (official): https://github.com/SWivid/F5-TTS, see
I’ll go over each relevant portion of the papers. When the papers’ notation for the same concepts conflict, I’ll default to the F5 notation.
0 Data: Audio as Mel Spectrograms & Text Transcripts
What data do the models use?
0.1 Speech Datasets
Model | Dataset (hours) | Dataset Name |
---|---|---|
Voicebox | 60K (English), 50K (Multi) | Unknown |
E2 | 50K (English) | Libriheavy1 |
F5 | 95K (Multi) | Emilia2 |
Voicebox comes with two models, VB-En and VB-Multi, which were “trained on 60K hours of English audiobooks and 50K hours of multilingual audiobooks in 6 languages for the mono and multilingual setups.” I couldn’t find any further information on the datasets used, so they were probably proprietary. F5 is trained on Emilia, which contains 101k hours and 6 languages, but after filtering out misclassified data,3 they were left with “95K hours of English and Chinese data.”
Each paper also used smaller datasets for ablation studies and similar.
0.2 Mel Spectrograms
The three models all used log mel spectrograms4 to represent audio, for both the input and output of the models. They use off-the-shelf vocoders, sometimes with minor adaptations.
Voicebox | E2 | F5 | |
---|---|---|---|
Vocoder model | HiFi-GAN V1 | BigVGAN | Vocos |
Mel feature dim | 80 | 100 | 100 |
Audio sampling rate | 16 kHz | 24 kHz | 24 kHz |
Frame shift (hop length) | 160 samples (10 ms) | 256 samples (10.7 ms) | 256 samples (10.7 ms) |
STFT size | 1024-point | 1024-point | 1024-point |
STFT window size | 640 samples (40 ms) | 1024 samples (42.7 ms) | 1024 samples (42.7 ms) |
Windowing function | Hann | Hann | Hann |
Cutoff frequency | 8 kHz | 12 kHz | 12 kHz |
Information from the linked papers and codebases, or from the papers introducing the listed vocoders.
Voicebox uses lower-resolution audio than E2 and F5.
0.3 Text Representation
All papers train on audio samples that come with with text transcripts, forming (audio, text) data pairs $(x,y)$. Training involves (audio, text) inputs $(x_1, z_{emb})$, where $x_1$ represents the mel features and $z_{emb}$ represents the text embedding. In each paper, the text $y$ is processed into a different character sequence $z$, then embedded into $z_{emb}$, before being combined with the audio input and passed into the main part of the model.
Voicebox:
- $y$ is represented as a phone sequence $y = (y^1, y^2, …, y^M)$
- Duration model (see 1.3 and 3.2) is used to predict phone durations $l = (l^1, l^2, … , l^M)$
- Each $y^i$ is repeated $l^i$ times to get the frame-level phone transcript $z = (z^1, z^2, …, z^N)$, where $N$ is the mel sequence length
- $z$ is projected into $z_{emb}$
- Embedding dimension of $z_{emb}$, termed $H$, is not given5
E2:
- $y$ is represented as a character sequence $y = (c_1, c_2, …, c_M)$
- Given character count $M$ and mel sequence length $N$, pad $y$ by $(N-M)$ times to $z = (c_1, c_2, …, c_M, \langle F \rangle, \ldots, \langle F \rangle)$
- $z$ is projected into $z_{emb}$
- Embedding dimension of $z_{emb}$, termed $E$, is not given6
F5:
- Like E2, $y$ is represented as $y = (c_1, c_2, …, c_M)$ and padded to $z = (c_1, c_2, …, c_M, \langle F \rangle, \ldots, \langle F \rangle)$
- As described in 3.3, the text is embedded and passed through four ConvNeXt V2 layers to get the final $z_{emb}$
- Embedding dimension of $z_{emb}$ is 512
1 Training Tasks: Infilling & Unconditional Generation
What tasks are the models trained to do?
1.1 Infilling
All models are primarily trained to perform infilling: predicting a missing speech segment, given its surrounding audio and the full text transcript.
- Given a data pair $(x, y)$ with audio sample $x$ and transcript $y$
- Get mel spectrogram features $x_1 \in \mathbb{R}^{F \times N}$ with mel dimension $F$ and sequence length $N$
- Get text embedding $z_{emb} \in \mathbb{R}^{E \times N}$ with embedding dimension $E$, as described in 0.3
- Create a binary temporal mask $m \in {0, 1}^{F\times N}$
- $(1 − m) \odot x_1$ is the audio context, provided as input
- $m \odot x_1$ is the masked audio to predict
- A continuous segment spanning 70-100% of mel frames is masked.7
- Using conditional flow matching, train the model to learn the distribution $P (m \odot x_1 | (1 - m) \odot x_1, z)$
- At each flow step $t \in [0, 1]$, we have noisy speech $(1 - t)x_0 + tx_1 \in \mathbb{R}^{F \times N}$, where $x_0$ is Gaussian noise
- Stack masked audio $(1 − m) \odot x_1$, noisy speech $(1 - t)x_0 + tx_1$, and text embedding $z_{emb} \in \mathbb{R}^{E \times N}$ to get input $\in \mathbb{R}^{(2F + E) \times N}$, then project it to $\mathbb{R}^{D \times N}$ for model dimension $D$
- Voicebox and E2 append a time (flow step) embedding $\hat{t} \in \mathbb{R}^D$ to the input tensor to get shape $\mathbb{R}^{D \times (N+1)}$
- F5 doesn’t do this because its time embeddings are passed into each transformer layer through adaLN-zero instead (see 3.1)
- For details on the loss, see 2
1.2 Unconditional Generation (Classifier-Free Guidance)
In addition to infilling a masked portion with audio and text context, the model is sometimes instead trained without audio and/or text context. This technique is called classifier-free guidance, helps improve diversity, and allows the user to tradeoff between diversity and fidelity during inference by adjusting guidance strength.
Every model dropped all context (audio and text) with 20% probability (Voicebox for both the audio and duration models). Due to an ambiguous statement in the Voicebox paper,8 the F5 paper additionally dropped the audio context only with an independent 30% probability, such that the overall chance audio context would be dropped was $(100 - (100-30)(100-20))\%$. The F5 authors were under the impression Voicebox and E2 did the same. I’m not sure if they’re right, since neither Voicebox nor E2 have officially released code.
1.3 Voicebox: Duration Prediction
One trait of non-autoregressive TTS models, as opposed to autoregressive ones, is that you must specify a duration length in order to sample. People talk at different speeds, so figuring out how long a sentence should take to speak is not obvious.
Voicebox alone trains a separate model to predict duration. In contrast, E2 simply allows the user to specify a duration, and F5 assumes the duration is proportional to character count.[^9] The default duration model used by Voicebox is a simple $L_1$ regression model which predicts the length of each phone in a text.
Voicebox also tested two other duration models, a flow-matching based model similar to their audio model and an unconditional model. Details in Appendix C.2 of the Voicebox paper.
2 Training Objective: Conditional Flow Matching with Optimal Transport
All three papers use conditional flow matching with optimal transport. This framework, flow matching, is similar to diffusion.
At each flow step $t$, the model, $v_t$, outputs a mel spectrogram conceptualized as a vector field mapping $x_t$ to $x_1$. The loss is:
\[\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, q(x_1), p(x_0)} \left || v_t \left( (1 - t)x_0 + t x_1 \right) - (x_1 - x_0) \right ||^2\]where $p(x_0)$ is the conditional probability path (time-dependent PDF mapping $x_0$ to $x_1$) and $q(x_1)$ is the distribution of $x_1$ (and $p_1 \approx q$). The flow matching is called conditional because $p$, $q$, etc. are conditional on the specific data sample $x_1$.
$(1 - t)x_0 + t x_1$ corresponds to the noisy speech at $t$ (the other inputs are omitted), and $x_1 - x_0$ is the path from the noise to the desired output. The exact formulations of these parameters is what makes it an optimal transport path.9
During inference, the vector fields outputted by the model at different flow steps are used along with an ODE solver to estimate $\phi_1(x_0)$. The flow $\phi$, which satisfies $d\phi_t (x_0)/dt = v_t(\phi_t(x_0), x_\text{ref}, z_{\text{ref} \cdot \text{gen}})$, is a function that gives the generated audio $x_\text{ref} \cdot x_\text{gen}$ from noise using the references $x_\text{ref}, z_\text{ref}$ and desired text $z_\text{gen}$. See more about inference in 4.
3 Model Architecture: U-Net Transformer vs. DiT
3.1 Audio Generation Architecture
All models generate audio with a tall stack of transformers, about 330M total parameters.
Voicebox (Audio Model) | E2 | F5 | |
---|---|---|---|
Parameters | 330M | 335M | 335.8M |
Layers | 24 | 24 | 22 |
Attention Heads | 16 | 16 | 16 |
Transformer Backbone | w/ U-Net skip connections | w/ U-Net skip connections | DiT w/ adaLN-zero |
Embedding Dimension | 1024 | 1024 | 1024 |
FFN Dimension | 4096 | 4096 | 2048 |
Flow Step Pos. Embedding | Sinusoidal, concat. with input | Sinusoidal, concat. with input | Sinusoidal, fed into adaLN every layer |
Input Pos. Embedding | Convolutional | Convolutional | Convolutional |
Attention Pos. Info | Bidirectional ALiBi | Bidirectional ALiBi…? | Rotary position embeddings |
Text Embedding Convolutions | - | - | 4 ConvNeXt V2 layers |
E2 uses the same architecture as Voicebox’s audio model. F5 uses a very similar architecture with two key changes:
- Adding convolutional layers for text embedding F5 authors aimed to fix robustness issues in E2. E2 stacked the character embedding sequence directly with the extracted mel features (audio input), using only a linear layer to project this to the transformer input dimensions, and the F5 authors claimed this causes a “deep entanglement of semantic and acoustic features” that confused the model. F5 instead feeds the text embeddings through four ConvNeXt V2 layers first. Intuitively, these layers give the model an opportunity to learn how to translate between text and audio.
- Replacing transformer backbone with DiT F5 stopped using the UNet-style skip connections (which connect the first and last layer, second and second-last, etc.). They switched to diffusion transformers, which use adaLN-zero (adaptive layer norm, initialized to zero). This introduces additional gating, which you can see in Figure 1 of the F5 paper (the purple “timestep” input into each transformer block and its associated scale/shifts). The authors don’t offer much explicit justification for this change, but they do experiment with model architecture variations in section 5.1.
The files dit.py
and unett.py
in the official F5 repository, which inclues an E2 replication, are useful to compare. F5 is a DiT
with parameters dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4
, and E2 is a UNetT
with parameters dim=1024, depth=24, heads=16, ff_mult=4
.
F5 also has fewer layers and a smaller FFN dimension, likely chosen to ensure a comparable total parameter count. It uses rotary position embeddings rather than bidirectional ALiBi self-attention bias to capture positional information for attention, a minor change done probably for convenience.10
3.2 Voicebox: Duration Model Architecture
Voicebox’s duration model is a smaller stack of transformers, without the UNet skip connections. It was replicated by E2 for comparison.
Voicebox (Duration Model - English) | Voicebox (Duration Model -Multilingual) | |
---|---|---|
Parameters | 28M | 34M |
Layers | 8 | 10 |
Attention Heads | 8 | 8 |
Embedding Dimension | 512 | 512 |
FFN Dimension | 2048 | 2048 |
4 Inference: Flow $\phi$ and Sway Sampling
Inference for all three models is very similar. The model takes an audio reference $x_\text{ref}$ $z_\text{ref}$ and a concatenated reference-plus-desired-output transcript $z_{\text{ref} \cdot \text{gen}}$ (Voicebox needs to use its duration model to get the phone durations for $z$), then inferences using the flow $\phi$.
For each flow step ${ t_i } \in [0, 1]$:
- With model $v_t$, evaluate $d\phi_{t_i} (x_0)/dt = v_t(\phi_{t_i}(x_0), x_\text{ref}, z_{\text{ref} \cdot \text{gen}})$
- $\phi_{t_i}(x_0)$ is “the noisy speech” calculated from the previous flow step
- $\phi_{0}(x_0) = x_0$
- Vector field can be modified using CFG for increased diversity. This requires evaluating $v_t$ twice, once with and once without condition
- Use an ODE solver to estimate $\phi_{t_{i+1}}$
- Voicebox and E2 used midpoint, F5 used Euler
Repeat until you get the desired output $\phi_{1}$. Crop $x_\text{ref}$, then feed the output through a vocoder to get the final audio. Each model uses a different vocoder, see 0.2.
The number of flow steps $t_i$ is arbitrary, but the papers defaulted to 32. More steps means higher quality. Voicebox and $E_2$ evenly space ${ t_i }$, whereas F5 used sway sampling to bias steps towards 0. Intuitively, the beginning of the flow is the “hardest”, so the model is given more steps to estimate it.
-
https://arxiv.org/abs/2309.08105 ↩
-
https://arxiv.org/abs/2407.05361 ↩
-
Embarrassing for Emilia if true. So much for 6 languages. ↩
-
A common way to efficiently represent audio waveforms. Involves a filterbank that extracts features. Mel spectrograms must be converted back into a waveform via a vocoder. ↩
-
I can’t find it specified in the paper, anyways. lucidrain uses
dim_cond_emb=1024
for $H$. ↩ -
They mention a “character embedding vocabulary size” of 399, which seems to be the number of unique characters, not the embedding dimension. The F5 authors’ replication of E2 defaults to 100. ↩
-
The implementation is identical for all papers, with a complication. The Voicebox paper has both an audio model (the part that actually generates speech) and a duration model to train, which is explained in 1.2. Voicebox contains this weird line: “The audio/duration sequence is masked with $p_{\text{drop}} = 0.3/0.2$, and otherwise a segment of r% sequence length is masked, where $r ∼ \mathcal{U}[70, 100]/\mathcal{U}[10, 100]$”, suggesting they do something else (involving $p_{\text{drop}}$) in addition to the 70-100% thing.
In lucidrain’s implementation of Voicebox training, the duration sequence has a default 50% chance of having 20% ($p_{\text{drop}} = 0.2$) of the sequence individually masked, and a 50% chance of having a continuous segment spanning 10-100% masked. The corresponding $p_{\text{drop}} = 0.3$ for the duration model shows up in the code but is never used (it wouldn’t make sense to individually mask audio mel frames in the same way duration frames were, anyways).
The F5 authors instead take $p_{\text{drop}} = 0.3$ to refer to the probability the audio input is masked entirely, which they interpret as a version of classifier-free guidance. Seecfm.py
. ↩ -
The $p_{\text{drop}} = 0.3/0.2$ thing. ↩
-
The justification goes something like, “this is the simplest way to go from $x_0$ to $x_1$”, as opposed to a diffusion path, which involves something more complicated in your $\mathbb[E]$. Voicebox tested OT against diffusion paths and found that OT is better. The name amuses me, though. Brb, developing this new thing I’m calling Superior S6 (S7) with Infinitely Awesome Objective. ↩
-
Bidirectional ALiBi was invented by the Voicebox authors, and I’m not aware of a public implementation for it. The unofficial Voicebox implementation uses rotary embeddings, complaining that bidirectional ALiBi is nontrivial to reproduce. The E2 paper does not clarify whether it uses bidirectional ALiBi or rotary, and it has no official code. ↩