資訊組
22601 王政祺 & 22625 劉至軒
「Deep Learning」+「Fake」
By building a DeepFake of our own, we can better understand the limitations of DeepFakes, and how to counter them.
Obviously, being able to stuff words into other peoples' mouths is fun (as long as the words are harmless)! So why not do an academically stimulating and fun project?
"Hmm. In the end, cowards are those who follow the dark side."
"Oh my goodness"
AutoVC: Zero-Shot Voice Style Transfer with Only AutoEncoder Loss
Encoder
Decoder
Encoding
Input
Output
Learns an encoding (representation) for the input data
Learns to decode the encoding and produce some output
Learns to decode the encoding and produce some output
Learns to represent that it's a '7' and ignore the noise
Encoding
Decoder
Encoder
Content
Encoder
\(E_C(\cdot)\)
\(E_S(X_1)\)
\(E_C(X_1)\)
Style
Encoder
\(E_S(\cdot)\)
\(X_1\)
Tries to separate the content (what is actually being said, i.e., phonology, tones) and the speaker data (accent, voice)
The content encoder receives both the style encoding and the original spectrogram as input as it needs to learn to separate the two
Content
Encoder
\(E_C(\cdot)\)
\(E_S(X_1)\)
\(E_C(X_1)\)
Style
Encoder
\(E_S(\cdot)\)
\(X_1\)
\(E_S(X_2)\)
We want the style of \(X_2\) and the content of \(X_1\)
\(E_S(X_2)\)
\(E_C(X_1)\)
Decoder
\(D(\cdot, \cdot)\)
\(\tilde{X}_{1\rightarrow 2}\)
\(R_{1 \rightarrow 2}\)
Initial estimate
Residue
\(\hat{X}_{1\rightarrow 2}\)
WaveNet
(Spectrogram to Wave)
Finished Product!
(auspicious3000 @ IBM)
Content
Encoder
\(E_C(\cdot)\)
Decoder
\(D(\cdot, \cdot)\)
Generator
WaveNet
(Spectrogram to Wave)
(r9y9 @ Google)
Style
Encoder
\(E_S(\cdot)\)
(CorentinJ @ Resemble AI)
Generator
Generator
mostly complete!
Is in testing phase
Is in testing phase