Speech Separation

using Deep Clustering

Maximilian Ammann

Supervisor: Shuo Liu

Motivation for Speech Separation

Source: Star Trek: The Next Generation, 5x28: Déjà Vu

* Separating a speech mixture into multiple speech segments that belong to different speakers * Challenge lies in separating a mixture which consists of sources of the same domain and from unknown speakers

Framework

* Feature Extraction: Creating the input features for the neural network. The input mixture is transformed using STFT * RNN: Calculates for each time-frequency bin a -dimensional vector using a RNN * Clustering: Determines the centers of the resulting clusters in the embedding space * Creation of Mask: Mask is calculated by using the label information retrieved from clustering * Waveform Reconstruction: **Only the amplitude information of the STFT**

* $n_{recurrent}$ LSTM hidden layers with $n_{hidden}$ hidden units * single fully-connected layer with $\tanh$ as activation function * Dropouts are applied on the forward and backwards layers of the RNN * LSTMs are beneficial in comparision to GRUs (conducted experiment)

Training the RNN

In order to train an RNN which produces similar embeddings for the same speaker we minimize:

Pulls embeddings of the same class closer together

$ J(Y, V) = || VV^{\mkern-1.5mu\mathsf{T}} - YY^{\mkern-1.5mu\mathsf{T}} ||^2_F = \texfragment[index=0]{ \sum_{\substack{i=0,j=0 \\ y_i = y_j}} \left( \texapply[class=math-highlight, index=1]{| v_i - v_j |^2 - 1} \right) + \sum_{i=0,j=0} ( \texapply[class=math-highlight-alt, index=2]{v_i^{\mkern-1.5mu\mathsf{T}} v_j} )\texapply[class=math-highlight-alt, index=2]{^2} } $

Pushes embeddings of the different classes apart

* $VV^{\mkern-1.5mu\mathsf{T}} \in \mathbb{R}^{p \times p}$ is the estimated affinity matrix * $YY^{\mkern-1.5mu\mathsf{T}} \in \mathbb{R}^{p \times p}$ is the ideal affinity matrix

As the matrix $VV^{\mkern-1.5mu\mathsf{T}}$ consumes a lot of resources the above equation is simplified as a low rank form:

$J(Y, V) = || V^{\mkern-1.5mu\mathsf{T}} V ||^2_F - 2 || V^{\mkern-1.5mu\mathsf{T}} Y||^2_F + || Y^{\mkern-1.5mu\mathsf{T}} Y||^2_F$

Clustering

$k$-Means is used to cluster the learnt embedding vectors.

The loss function of $k$-means adapted to our task is

$\gamma=\sum_{i=0}^{p} \sum_{j=0}^{k} b_{ij} ||v_i - c_j||^2 = ||V-BM||_F^2$

where:

$B \in \mathbb{R}^{p \times k}$ with $b_{ij}=\begin{cases} 1 & \text{if $v_i$ is estimated to belong to speaker $j$} \\\\ 0 & \text{otherwise} \end{cases}$,
$v_i \in V$,
$c_j$ is the center of the cluster for speaker $j$ and
$M=(Y^{\mkern-1.5mu\mathsf{T}} Y)^{-1}Y^{\mkern-1.5mu\mathsf{T}} V$.

The training objective $J(Y, V)$ and the $k$-means objective $\gamma$ are small, then $VV^{\mkern-1.5mu\mathsf{T}} \approx YY^{\mkern-1.5mu\mathsf{T}}$ which leads to $B \approx Y$

Waveform Reconstruction

The matrix $B \in \mathbb{R}^{p \times k}$ from the clustering step can be interpreted as binary mask:

$\mathrm{IBM}_j(t, f)=b_{t(\frac{M}{2}+1)+f,j}$

The spectrum of an individual speaker can be obtained by multiplying the spectrum of the mixture with the $\mathrm{IBM}$: $\tilde{S}_j(t,f)=\mathrm{IBM}_j(t,f) \cdot S(t,f)$

With the phase information of the original signal we reconstruct the speech signal $x_j(n)$ using the inverse STFT and the the overlap-add approach.

Experiments

**Data** Three data sets are used to test the DC algorithm, each containing 30h training and 5h evaluation data: * TIMIT [54] (Texas Instruments + MIT) and WSJ0 [55] (Wall Street Journal), which contain professional audio recordings * TEDLIUM [56], which contains recordings of TED talks with varying quality.

**Metrics:** * ISR (Image to Spatial Distortion Ratio) * SIR (Source to Interference Ratio) * SAR (Source to Artifact Radio) * SDR (Source to Distortion Ratio) which combines the above three

Hyperparameters

Experiments

Experiment results for adjusting the learning rate.

Performance A

Example inference on the WSJ0 data set with a male and a female speaker.

Performance B

Example inference on the TIMIT data set with two female speakers.

Performance C

Example inference on the noisy TEDLIUM data set with two male speakers.

Are there concurrently speaking people?

**⇒ Speaker count detection using Order Selection**

Main approaches: * Determine the distance between the two centers * Calculate Within Cluster Error (WCE)

Order Selection

(a) Euclidean distance between two centers of the clusters for $k = 2$.

(b) The WCE of a speech signal for $k \in \{1,2\}$.

Visualisation of Clustering

Conclusion and Outlook

* Trained a RNN to produce similar embeddings for TF-bins which belong to the same speaker * Separation performance suffers from noise in data sets but is speaker independent * Order Selection is exploited to detect the existence of overlapping speech * The process of clustering the learned embeddings via k-means is visualized

Backup Slides

K-Means

The goal of k-means is to cluster the vector set $\mathcal{X} \subset \mathbb{R}^d$ of $n$ vectors given an integer $k$ which denotes the amount of expected partitions. The result of this operation is a set $\mathcal{C} \subset \mathbb{R}^d$ of $k$ centers. Like neural networks this algorithm also includes a loss function which is the target for optimization: $\gamma=\sum_{x \in \mathcal{X}} \min_{c \in \mathcal{C}} ||x-c||^2$ The objective is to minimize for each $x \in X$ the distance to its nearest center.

Order Selection $k=1$

(a) The plot shows the Euclidean distance between the origin and the center for $k = 1$, as well as the distance between the two clusters for $k = 2$.

Evaluation results of the baseline model.

Evaluation results for adjusting the learning rate.

Evaluation results for adjusting the amount of hidden units.

Evaluation results for adjusting the batch size.

Evaluation results for adjusting the embedding dimension.

Evaluation results for adjusting the dropout.