Abstract

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech [1].

VITS aims to improve the performance of ene-to-end (single stage) TTS model, so that the quality of synthesized speech meets or exceeds that of two-stage systems.This paper is published at ICML 2021.

This note provides explanation and summary of VITS.

Read more »

Abstract

Normalizing flow can convert simple probability densities (e.g. Gaussian Distribution) into some complex form of distribution. It can be used in generative model, reinforcement learning, variational inference, and so on. Flow means that the data “flow” through a series of bijection (invertible mapping) to map to the appropriate representation space. Normalizing means that the variables in the representation space integrate to $1$, satisfying the definition of the probability distribution function.

Read more »

Abstract

Variational Auto-Encoders (VAEs) are a type of generative model that combines probabilistic graphical models and neural networks. They are capable of learning latent variable representations of data and generating new data samples given some input data. VAEs approximate the log-likelihood by maximizing the variational lower bound, thus avoiding the direct maximization of potentially intractable objective functions.

Read more »

Abstract

Beamforming is a technique that enhances speech signal quality by directing the reception or transmission of sound waves, playing a crucial role in speech signal processing. Traditional beamforming algorithms, such as Minimum Variance Distortionless Response (MVDR), as well as recent deep learning-based methods like ADL-MVDR, have been widely applied. This article aims to introduce the principles, implementation, and advantages and disadvantages of various mainstream beamforming algorithms.

Read more »
0%