[Deep.In. Article] A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

Deep Learning Team : Dunkin

Abstract

Lip sync technology, which generates the right movement of the lips for a given voice data, is one of the most popular field in deep learning. Let’s take a movie as an example. What if a foreign actor dubs according to the language of our country? Like an actor who lived in Korea for a long time, the meaning of speech will be expressed well, and the immersion will be much better. In addition it is not surprising that the news shows politicians from other countries speaking in Korean through deep learning technology. Therefore, natural and accurate lip sync technology is expected to bring a big leap forward to the future service and communication industry.

 

How will lip sync technology be implemented? It can be explained in two main steps. First, neural network learns to match the main coordinates of the lip shape syncing with sound.

Then, it learns to synthesize realistic lip based given set of mouth keypoints. The technology used at this step is the Generative Adversarial Network(GAN). This GAN is a type of neural network that releases outputs that has similar distribution with prior learned dataset which has certain features.

Let’s take an example. If the Bank of Korea teach neural network the shape or color distribution of the currency, it will be able to create realistic counterfeit note. Therefore, the neural network learns to make realistic human lip shapes if we teach the approximate main keypoints.

However, the network cannot easily learn the technique because the things that make realistic lip shape and synthesize the human lower jaw are very complicated tasks. In particular, if you irresponsibly pass on all of these complex homework to learn well to your network, it is easy to observe that the sound and lips don’t match with unrealistically synthesized faces.

 

Main Contribution of Paper

  1. A lip-synchronization network Wav2Lip structure that works well for input speech even in harsh condition was proposed with state-of-the-art performance.
  2. Benchmark and metric were proposer to evaluate the performance of lip-sync.
  3. They collected and provided a dataset called Real-world lip-Sync Evaluation(ReSyncED).
  4. When evaluating the synthesized video, more than 90% of people evaluated Wav2Lip performed better than previous lip sync models.

 

Previous SOTA Baseline : LipGAN Model

The author cited LipGAN[1], the previous SOTA network, as a baseline. A brief summary is as follows.

 

  • Type of Data
  1. Voice data transformed by MFCC(Mel-Frequency Cepstral Coefficient) technique.
  2. Image of the face of the target person to be synthesized(unsync image with voice data)
  3. Image of the face of the target person to be synthesized(bottom half of sync image is covered)

 

[Network Mechanism]

 

  1. The Audio Encoder(4 blocks) expressed in red color calculates the MFCC data.
  2. The Face Encoder(7 blocks) expressed in blue color calculates the synced face image(bottom half covered) and un-synced whole face image.
  3. Combine the audio embedding vector and the face embedding vector created through the two encoders(red and blue color).
  4. The green color Face Decoder(7 blocks) synthesizes the face from combined embedding vector. At this time, keep skip connection like U-Net so that face information can be well preserved and delivered to decoder. This decode process acts as a generator in GAN. (allocation L1 loss for reconstructing the target ground truth face image)
  5. The synthesized image and ground truth image(face synchronized with voice data) enter the yellow Face Encoder and change it to embedding vector through several operations.
  6. Similarly, the audio MFCC data used as input is mad into an embedding vector through a gray audio encoder(4 blocks).
  7. Contrastive loss allows the voice embedding vector and face embedding vector to become 0 if they are un-synced and 1 if the are synced.

 

Limitation

  1. Excessive amount of tasks were assigned to the generator. That structure teach the work of synthesizing realistic faces that can reveal the target human’s identity and the work of determining whether the lip movement is sync or not through synthesized images. In other words, not just study math and take math exam, but study math and English together and take exams for two subjects. Therefore, existing networks such as LipGAN model learn complex tasks at once, so it is difficult to synthesize appropriate mouth shapes.
  2. If you actually spend about 20 epochs on learning, almost half of the epoches are biased toward facial synthesis, and the lip synthesis is only after that. Therefore, learning the shape of the lips is only a few of the entire learning process. The author pointed out that the loss around the mouth is 4% lower performance than the pixel reconstruction.
  3. LipGAN synthesizes only one single frame. However, considering that the shape of the mouth is actually affected by the aforementioned voice, synthesizing image from multi-frame that can learn prior knowledge is more appropriate for natural mouth movements.

 

Wav2Lip Model

To improve LipGAN’s issues, the author proposes a structure called Wav2Lip.

  • Type of Data
  1. Voice data transformed by MFCC(Mel-Frequency Cepstral Coefficient) technique.
  2. Image of the face of the target person to be synthesized(unsync image with voice data)
  3. Image of the face of the target person to be synthesized(bottom half of sync image is covered)

 

  • Network Mechanism
  1. The Audio Encoder expressed in green color calculates the MFCC data.
  2. The Face Encoder expressed in blue color calculates the synced face image(bottom half covered) and un-synced whole face image. Unlike LipGAN, we used several consecutive frames instead of single frame.
  3. The audio embedding vector and face embedding vector made by two encoders are combined to pass through the decode and reconstruct the target ground truth image set. Here we allocate L1 Loss for reconstruction.

 

  1. Generated images and ground truth images are evaluated by Visual Quality Discriminator whether the image is realistic or not, about not voice sync but visual artifacts. Unlike LipGAN, binary cross entropy loss was used, not contrastive loss. They help removing visual artifact regardless of voice sync and focus only on realistic facial synthesis. It foster monster student who can solve problem about mathematics.
  2. It should be left to the expert to determine if the voice’s synchronization is excellent. Bring the Expert, a pre-trained Lip-Sync Discriminator, to evaluate whether synchronization is right between sound and image. The main point is that your network need to get a reliable score from a well-learned expert, otherwise they can’t develop their synthesizing skills. In this paper, they argue to bring a smart pre-train network that can professionally discriminate only synchronization. It can make the accurate sync loss between synthesized image and voice data. More precisely, cosine similarity loss is assigned to score 1 if the sync is right and 0 if it is not right.

 

Evaluation Metirc

  • Dataset
  1. LRW [4]
  2. LRS2 [5]
  3. LRS3 [6]
  • Dataset
  1. LRW [4]
  2. LRS2 [5]
  3. LRS3 [6]
  • SyncNet : LSE-D, LSE-C

SyncNet is a network that has emerged to determine whether a video is fake or not[2]. When you input mouth shape of video and voice MFCC data, the network outputs the distance is close if the sync is right. If the sync is wrong, they output far distance between audio embedding vectors and video embedding vectors.

 

At this time, Lip-Sync Error Distance(LSE-D) is used as the evaluation item to determine whether the frame and voice data sync is right.

 

 

If you give temporal offset between video frame and audio, we can compare the distance between audio and video embedding vectors. For the moment when the sync matches(where the temporal offset is 0), the LSE-D is small, and the offset increases, causing the distance to move away. Therefore, Lip-Sync Error Confidence(LSE-C), a kind of reliability indicator, has emerged to see that video and sound have fit sync part according to the change in distance value. They calculate the difference between the median value and the minimum value of distance.

 

  • FID (Frachet Inception Distance)

If you give temporal offset between video frame and audio, we can compare the distance between audio and video embedding vectors. For the moment when the sync matches(where the temporal offset is 0), the LSE-D is small, and the offset increases, causing the distance to move away. Therefore, Lip-Sync Error Confidence(LSE-C), a kind of reliability indicator, has emerged to see that video and sound have fit sync part according to the change in distance value. They calculate the difference between the median value and the minimum value of distance.

 

Results

1. Temporal Window: One of the big differences from Baseline’s LipGAN is that Wav2Lip uses multi-frame as its input. In fact, as a result of learning by increasing the number of frames, it was found that both LSE-D and LSE-C showed good performances as the thermal window increased.

 

2. Pre-trained Discriminator : As a result of using the pre-train network Expert which helps to check only lip synchronization professionally, LSE-D and LSE-C evaluation items showed better performance than the existing Speech2Vid [3] and LipGAN models. refer to Wav2Lip (ours)

 

 

3. Visual Quality Discriminator : Unlike LipGAN, adding a discriminator that compares only vision images to determine real/fake showed a slight decrease in performance in LSE-D and LSE-C, but in terms of FID, visual image quality is much better. Therefore, you can express a much more realistic lip movement. It also received much higher preference and user experience scores. Refer to Wav2Lip + GAN(ours)

 

 

Conclusion and Opinion

It is a network that can synthesize much more accurate lip sync videos than previous models. It was impressive that it was not limited to the use of discriminators to remove visual artifacts, but that it further boosted performance with extraneous discriminators learned in advance for much better synchronization. In addition, various metrics and datasets were provided for performance evaluation, and they proved higher objectivity and reliability through preference score through user experience. In near future, motion presentation such as gesture and head pose will be added, and much of the research is already being conducted. It is expected that the lip sync synthesis model through deep learning will develop further and approach humans as a richer service.

 

Reference

[1] Towards Automatic Face-to-Face Translation

[2] Out of time: automated lip sync in the wild

[3] Adaptive subgradient methods for online learning and stochastic optimization

[4] Lip reading in the wild

[5] Deep Audio-Visual Speech Recognition

[6] LRS3-TED: a large-scale dataset for visual speech recognition

[7] U-Net: Convolutional Networks for Biomedical Image