[Deep.人. Article] DeepBrain AI’s Deep-Learning-based Video and Voice Synthesis Technology

AI Human is a technology that allows you to naturally express not only voices such as human speech and intonation, but also faces, facial expressions, and movements through video by learning human faces based on deep learning AI technology simply with an input of a text.

Today, we will explain the learning model related to deep learning-based image synthesis and introduce you to DeepBrain AI’s AI Human implementation technology.



1) Main learning technology model

[CNN-Image Classification Algorithm]
It’s a technology that analyzes images by applying shared weights (Filter) with Convolution Neural Networks. Feature refers to data extracted from various features from input.


<CNN Architecture>


Function of CNN is to classify and recognize images.


Generative Adversarial Networks (GAN) is a hostile neural network deep learning model that repeats learning until it is impossible to distinguish it from the real thing by creating a real “likely fake” at first glance.
After the constructor generates an image from random noise, the discriminator looks at the true image and fake image and determines true/false to learn the constructor.




2) DeepBrain AI’s Original Technology



<Lip Sync, Face Synthesis Technology>

The Lip Sync method is a technology that controls the speech behavior (mouth shape, jaw movement, neck movement) of an image from a voice by synthesizing the original image so that the shape of the mouth matches a given voice by inputting an arbitrary voice in the video spoken by a particular person. In other words, you can synthesize a person image that speaks as an input of an arbitrary voice and background image.
In order to develop various behavioral patterns according to speech, it is performed by extracting feature vectors from the character’s speech image to inform the distribution of behavior patterns, and developing behavioral patterns according to speech by learning feature vectors from speech.


<Real-time Video Synthesis Technology>

DeepBrain AI was the first company in the world to succeeded in synthesizing real-time image through the development of process optimization technology. Basically, three major technologies are needed to implement video synthesis that can communicate with customers in real time. The first is the placement technology. To optimize the speed of image synthesis, we developed and applied our own batch processing technology. By simultaneously processing multiple synthesis requests, it is possible to reduce the latency required for image synthesis. Second, it is cache server optimization technology. Since most conversations can be formed into data and retained, questions and conversations that are expected to be used repeatedly are built on the cache server so that video can be transmitted quickly in real time. And lastly, it’s Idle Framing technology. The expression is natural while the artificial intelligence model is speaking, but if the user is stationary while speaking, the user can feel very unnatural. To overcome this, the gap can be minimized by giving the user a feeling that they are listening with natural movements while speaking.