Paper Note: VITS
Abstract
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [1].
VITS aims to improve the performance of ene-to-end (single stage) TTS model, so that the quality of synthesized speech meets or exceeds that of two-stage systems.This paper is published at ICML 2021.
This note provides explanation and summary of VITS.