ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech - arxiv.org

## Metadata
- Author: **arxiv.org**
- Full Title: ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech
- Category: #articles
- Tags: #ai
- URL: https://arxiv.org/abs/2207.06389
## Highlights
- Denoising diffusion probabilistic models (DDPMs) have recently achieved
leading performances in many generative tasks. However, the inherited iterative
sampling process costs hinder their applications to text-to-speech deployment.
Through the preliminary study on diffusion model parameterization, we find that
previous gradient-based TTS models require hundreds or thousands of iterations
to guarantee high sample quality, which poses a challenge for accelerating
sampling. In this work, we propose ProDiff, on progressive fast diffusion model
for high-quality text-to-speech. Unlike previous work estimating the gradient
for data density, ProDiff parameterizes the denoising model by directly
predicting clean data to avoid distinct quality degradation in accelerating
sampling. To tackle the model convergence challenge with decreased diffusion
iterations, ProDiff reduces the data variance in the target site via knowledge
distillation