Maximilian Ammann

Supervisor: Shuo Liu

* Separating a speech mixture into multiple speech segments that belong to different speakers
* Challenge lies in separating a mixture which consists of sources of the same domain and from
unknown speakers

In order to train an RNN which produces similar embeddings for the same speaker we minimize:

Pulls embeddings of the same class closer together

$
J(Y, V) = || VV^{\mkern-1.5mu\mathsf{T}} - YY^{\mkern-1.5mu\mathsf{T}} ||^2_F =
\texfragment[index=0]{
\sum_{\substack{i=0,j=0 \\ y_i = y_j}} \left(
\texapply[class=math-highlight, index=1]{| v_i - v_j |^2 - 1}
\right) + \sum_{i=0,j=0} (
\texapply[class=math-highlight-alt, index=2]{v_i^{\mkern-1.5mu\mathsf{T}} v_j}
)\texapply[class=math-highlight-alt, index=2]{^2}
}
$

Pushes embeddings of the different classes apart

* $VV^{\mkern-1.5mu\mathsf{T}} \in \mathbb{R}^{p \times p}$ is the estimated affinity matrix
* $YY^{\mkern-1.5mu\mathsf{T}} \in \mathbb{R}^{p \times p}$ is the ideal affinity matrix

As the matrix $VV^{\mkern-1.5mu\mathsf{T}}$ consumes a lot of resources the above equation is
simplified as a low rank form:

$J(Y, V) = || V^{\mkern-1.5mu\mathsf{T}} V ||^2_F - 2 || V^{\mkern-1.5mu\mathsf{T}} Y||^2_F + ||
Y^{\mkern-1.5mu\mathsf{T}} Y||^2_F$

$k$-Means is used to cluster the learnt embedding vectors.

The loss function of $k$-means adapted to our task is

$\gamma=\sum_{i=0}^{p} \sum_{j=0}^{k} b_{ij} ||v_i - c_j||^2 = ||V-BM||_F^2$

where:

- $B \in \mathbb{R}^{p \times k}$ with $b_{ij}=\begin{cases} 1 & \text{if $v_i$ is estimated to belong to speaker $j$} \\\\ 0 & \text{otherwise} \end{cases}$,
- $v_i \in V$,
- $c_j$ is the center of the cluster for speaker $j$ and
- $M=(Y^{\mkern-1.5mu\mathsf{T}} Y)^{-1}Y^{\mkern-1.5mu\mathsf{T}} V$.

The training objective $J(Y, V)$ and the $k$-means objective $\gamma$ are small, then
$VV^{\mkern-1.5mu\mathsf{T}} \approx YY^{\mkern-1.5mu\mathsf{T}}$ which leads to $B \approx Y$

The matrix $B \in \mathbb{R}^{p \times k}$ from the clustering step can be interpreted as binary mask:

$\mathrm{IBM}_j(t, f)=b_{t(\frac{M}{2}+1)+f,j}$

The spectrum of an individual speaker can be obtained by multiplying the spectrum of the mixture
with the $\mathrm{IBM}$:
$\tilde{S}_j(t,f)=\mathrm{IBM}_j(t,f) \cdot S(t,f)$

With the phase information of the original signal we reconstruct the speech signal $x_j(n)$ using
the inverse STFT and the the overlap-add approach.

**Data**
Three data sets are used to test the DC algorithm, each containing 30h training and 5h
evaluation data:
* TIMIT [54] (Texas Instruments + MIT) and WSJ0 [55] (Wall Street Journal), which contain
professional
audio
recordings
* TEDLIUM [56], which contains recordings of TED talks with varying quality.

**Metrics:**
* ISR (Image to Spatial Distortion Ratio)
* SIR (Source to Interference Ratio)
* SAR (Source to Artifact Radio)
* SDR (Source to Distortion Ratio) which combines the above three

**⇒ Speaker count detection using Order Selection**

Main approaches:
* Determine the distance between the two centers
* Calculate Within Cluster Error (WCE)

* Trained a RNN to produce similar embeddings for TF-bins which belong to the same speaker
* Separation performance suffers from noise in data sets but is speaker independent
* Order Selection is exploited to detect the existence of overlapping speech
* The process of clustering the learned embeddings via k-means is visualized

The goal of k-means is to cluster the vector set $\mathcal{X} \subset \mathbb{R}^d$ of $n$ vectors given
an integer $k$ which denotes the amount of expected partitions. The result of this operation is a set
$\mathcal{C} \subset \mathbb{R}^d$ of $k$ centers. Like neural networks this algorithm also includes a
loss function which is the target for optimization:
$\gamma=\sum_{x \in \mathcal{X}} \min_{c \in \mathcal{C}} ||x-c||^2$
The objective is to minimize for each $x \in X$ the distance to its nearest center.