Both training and testing are segmented by word boundaries.
Only the beginning 13-dim of MFCC vector is used.
Take k=100 for the representation's dimensionality.
Zero-masking technique is used to generate noisy input sequences in DSA.
Totally 5557 queries were made in testing set.
Evaluation metric: Mean Average Precision (MAP)
The minimum # of operations required to transform one string to another.
Observations
2 segments for words with larger phoneme sequence edit distances have obviously smaller cosine similarity in average.
SA and DSA can even very clearly distinguish those word segments with only one different phoneme.
The cosine similarity is a little bit small since even if two audio segments are exactly the same, they can have completely different acoustic realizations.
DSA outperforms SA.
Observations
It may be possible that the last few acoustic features dominate the vector representation, and as a result those words with the same suffixes are hard to be distinguished by the learned representations.
However, these results show that although SA or DSA read the acoustic signals sequentially, words with the same suffix are still clearly distinguishable.
Experiments
2 baselines
Frame-based DTW
Adopt vanilla version of DTW.
The frame-level distance is computed using Euclidean distance.
Experiments
2 baselines
Naive Encoder (NE)
Divide the input sequence into m segments with length of .
Average each segment vectors into one single 13-dim vector.
Concatenate all m average vectors to form one -dim vector.
\frac{T}{m}
mT
13 \times m
13×m
Observations
Besides DTW which computes the distance between audio segments directly, all other approaches map the variable-length audio segments to fixed-length vectors for similarity evaluation.
DTW has a poor performance.
Only the vanilla version was used.
The averaging process in NE may smooth some of the local signal variations which may cause disturbances in DTW.
Conclusions
The author proposed 2 unsupervised approaches which combines RNN and autoencoder to obtain fixed-length vector representations for audio segments.
The proposed approach outperforms the state-of-the-arts method in real world applications, such as query-by-example STD.