How AI version of Arirang TV anchor was created

DeepBrain AI recently created an AI version of a news anchor with Arirang TV, the largest global network broadcast company in Korea. 

 

The AI version of Arirang TV’s main news anchor Moon Connyoung, Jennifer Moon, is currently used for many news reports. Arirang TV recently unveiled the AI anchor making process and the CES 2022 event site through its feature news video.

 

AI anchor Jennifer Moon unveiled for the very first time at CES 2022. AI was one of the hottest topics at this year's CES and eye-catching innovations delight tech fans both young and old.

 

AI anchor Jennifer Moon was created in DeepBrain AI’s AI model production studio. Anchor Moon Connyoung, the actual model of the AI anchor, was filmed while reading scripts written in both Korean and English.

 

Based on the video and image data, AI anchor Jennifer Moon, who is fluent in both Korean and English, was produced through the deep learning algorithm that was created by DeepBrain AI.

 

We will keep you updated so please stay tuned for more stories of the AI anchor Jennifer Moon!

 


[Deep.In.Interview] Software Engineer : Peter

https://www.youtube.com/watch?v=90N6YF-R2Hs

Let me introduce you to the honest story behind Deep.In. in DeepBrain AI
*DeepBrain AI cherishes colleagues who work together. Deep.In. is DeepBrain AI employee’s nickname is to appreciate the strong fellowship we have between employees.

English name : Deep. In.
Korean name : 딥.인.
Internatioal name : Deep.人.

 

Q. Please introduce yourself.
Hi, my name is Peter and I am a full-stack software engineer here at DeepBrain AI.

Q. Tell us about Deepbrain AI's Dev. Team.
Our dev team is currently working on creating products that utilize Deepbrain AI’s unique AI human technology. Everyone is very passionate and supportive which makes working at DeepBrain very enjoyable.

 

Q. What are you working on these days?
I am focusing on building a Software as a Service product that integrates all of our services into a single platform. Eventually, we are hoping to give users more accessibility to our products like AI Studios, AI Human SDK, and AI Kiosk through this integrated SaaS platform.

Q. Tell us about AI Studios.
AI Studios is a video generation platform that combines all cutting-edge technologies of DeepBrain AI. By just entering a script, you can produce a variety of AI human featured videos. This greatly reduces the steps needed in creating production quality video contents. Users have been using AI Studios to create everything from Youtube videos to real-time news contents.

 

Q. What's your favorite part of working at DeepBrain AI?
Everyone is very supportive so when I am stuck on something, there’s always someone who I can reach out for help. Also, there are a lot of opportunities to try out new technologies here at DeepBrain. There’s always something new to learn everyday.

Q. What's your favorite company welfare benefit?
I particularly like Family day that promotes work/life balance and time with your family.

 

Q. Why did you decide to join DeepBrain AI?
I thought there would be so many applications for Deepbrain’s AI human technology. I wanted to contribute to developing unique applications like AI Studios using this cutting-edge AI technology.

Q. Is there anything you would like to say to your team members?
Thank you everyone for being so supportive, I always learn so much by working with you guys.

 


[Deep.In.Interview] Deep Learning Team : Noah!

 

https://youtu.be/FkUj1nj3P2g

Let me introduce you to the honest story behind Deep.In. in DeepBrain AI
*DeepBrain AI cherishes colleagues who work together. Deep.In. is DeepBrain AI employee’s nickname is to appreciate the strong fellowship we have between employees.

English name : Deep. In.
Korean name : 딥.인.
Internatioal name : Deep.人.

 

Q.Please introduce yourself
Hello, I'm Nam Kyu-hyun, a researcher at Deep Brain AI's Deep-Learning Team.

Q. What are your roles in the Deep-Learning Team?
Our team studies and develops technology that goes into creating AI Humans. Among various technologies, I’m mainly in charge of voice synthesis system. In addition, I also conduct research regarding NLP on Chatbots.

 

Q. Tell us about what it’s like to work in Deep-Learning Team.
Our team promotes collaboration and exchange of opinions. All of our team members, including our team leader, discuss problems together and conduct seminars when needed. And we are the only team in the company with a music concert during our lunch break.

Q. What do you think Deep-Learning is?
AI itself is a field that imitates and implements human intelligence. Deep-Learning does not deviate much from this. The model you use is an artificial neural network that abstracts the human brain. Numerous techniques that mimic human problem-solving and learning methods are scattered in this field. In a way, I think Deep-Learning is a way to understand people a little better.

 

Q. What does it take to work in the Deep-Learning Team?
This is probably the same for any team, but I think continuous self-development is important. For model development and improvement, processes such as research, seminars, implementation, and evaluation must be repeated continuously. Every process on its own needs new technologies, so getting ready for these changes and constantly learning about them would be a priority.

Q. What’s your favorite thing about DeepBrain AI?
I think it's about constant improvement of the work environment. For example, Decorated rooftop and massage chairs, improvements on welfare system, decorated Christmas trees on Christmas. I enjoy and appreciate these continuous improvements of our work environment.

 

Q. Why did you choose DeepBrain AI?
Something you can’t exclude when you picture the future of technology is AI Human. So, I've fantasized about AI Human technology, and as I was graduating university, I got to hear about DeepBrain AI and their technology. Since I was looking to work, I chose DeepBrain AI.

Q. Any message towards Deep-Learning Team members?
Someone once told me that Deep-Learning Team can be described as “Happy”. I believe that’s the most suitable word for our team. I’m always thankful for our teammates, and I hope we can be at the forefront of this technology.

 


DeepBrain AI implements AI human technology into KB Kookmin Bank. Deploys Korea’s first kiosk-type ‘AI banker’

▶ Contactless counseling service tailored to the COVID-19 situation and significant reduction in waiting time
▶ Provides information on financial products, branch information, weather and instructions on how to use banking devices within the branch
▶ Maximizing user experience with natural gestures such as hand movements and head nods

DeepBrain AI, a company specializing in artificial intelligence(AI), announced on the 28th that it has signed a technology supply agreement with KB Kookmin Bank, a leading financial company, and implemented Korea's first kiosk-type 'AI banker' and officially introduced it this month. .

Since March of last year, DeepBrain AI has been working closely with AI bankers to improve functions and enhance performance by piloting AI bankers in the AI experience zone located at KB Kookmin Bank’s Yeouido headquarters. As a result, it succeeded in commercializing AI human-based kiosk products for the first time in Korea, drawing great attention from the IT industry as well as the financial sector.

 

DeepBrain AI's AI human technology is a solution that creates a virtual human capable of real-time interactive communication. It implements AI that can communicate directly with users by fusion of speech synthesis, video synthesis, natural language processing, and speech recognition technologies. As a technology that can realize complete contactless service in various fields, banks have the effect of providing a secure counseling service to customers who prefer non-face-to-face in accordance with the COVID-19 situation, and shortening customer waiting time through faster response.

 

First, the AI banker greets customers when they arrive at the kiosk and provides answers to their questions. All answers go through the process of deriving optimal information based on KB-STA, a financial language model developed by KB Kookmin Bank, and delivered to customers through the AI banker's video and voice implemented with DeepBrain AI's AI human technology.

Specifically, it is possible to guide how to use peripheral devices such as STM (Smart Automated Machine), ATM (Automated Machine), and pre-writing service, introduce financial products, and guide the location of the kiosk installation point. In addition, it is loaded with information on convenience of living such as financial common sense, today's weather, and surrounding facilities.

 

In addition, the AI banker, with idle-mode, can make natural gestures such as moving hands, nodding, and tidying up clothes during conversation maximizing user experience from the customer's point of view. In addition, it is possible to recognize people through the front camera, so if a customer leaves their seat, the kiosk is automatically finished as a thank you.

In the case of clothes, the main colors of KB Kookmin Bank are yellow and gray, so that the brand image can be recognized by customers while using the kiosk.

 

Read More >>


[Deep.In. Article] AdaSpeech: Adaptive Text to Speech for Custom Voice

Deep Learning Team : Colin

Abstract

You may have experienced changing the voice of the guided voice while using AI speakers or navigation. I set the speaker voice with the voice of my favorite actor Yoo In-na, and it has become important to synthesize speech with various voices as speech synthesis technology has been incorporated into various parts of life, such as personal assistants, news broadcasts, and voice directions. And there is a growing demand to use not only other people’s voices but also their voices as AI voice, which is called custom voice synthesis in the field of speech synthesis research.

Today, we will look at a text-to-speech (TTS) model called AdaSpeech that appeared for custom voice synthesis. The technology to generate custom voice is mainly done through the process of adapting the pre-trained source TTS model to the user’s voice. Most of the user’s speech data used at this time is small for convenience purposes, and since the amount is small, it is a very difficult task to make the generated voice feel natural and similar to the original voice. There are two main problems with training neural nets with customized voice.

First, certain user’s voices often have acoustic conditions different from the speech data learned from the source TTS model. For example, there are a variety of rhymes, styles, emotions, strengths, and recording environments of speakers, and differences in speech data resulting from them can hinder the generalization performance of the source model, resulting in poor adaptation quality.

Second, when adapting the source TTS model to a new voice, there is trade-off in fine-tuning parameters and voice quality. In other words, the more adaptive parameters you use, the better quality you can produce, but the higher memory usage and the higher cost of deploying the model.

Existing studies have approached by specifying a method of fine-tuning the entire model or part (especially decoder), fine-tuning only speaker embedding used to distinguish speakers in multi-speaker speech synthesis, training speaker encoder module, and assuming that the domain of source speech and adaptive data is same. However, there is a problem with actual use because there are too many parameters or it does not produce satisfactory quality.

AdaSpeech is a TTS model that can efficiently generate new users' (or speakers) voices with high quality while solving the above problems. The pipeline was largely divided into three stages: pre-training, fine-tuning, and inference, and two techniques are used to solve the existing difficulties. From now on, we will look at them together! :)

 

Summary for Busy People

  • The generalization performance of the model was improved by extracting acoustic features according to various scopes from speech data and adding them to existing phoneme encoding vectors through acoustic condition modeling.
  • They have efficiently improved the process of adapting the source model to the data of the new speaker using conditional layer normalization.
  • It has become possible to create high-quality custom voices with fewer parameters and less new speech data than traditional baseline models.

 

Model Structure

AdaSpeech's backbone model is FastSpeech 2. It consists largely of phoneme encoders, variance adaptors, and mel decoder. It includes two new elements (pink areas in Figure 1) devised by the authors.

 

Acoustic Condition Modeling

In general, it is important to increase the generalization performance of the model because the source voice used in model training cannot cover all the acoustic features of the new user's voice. Since it is difficult to contain these acoustic features in the text entered by the model in TTS, the model has a bias in remembering acoustic features in the training data, which acts as a hindrance to generalization performance when generating custom voices. The simplest way to solve this problem is to provide acoustic features as input of the model, which is divided into speaker level, utterance level, and phoneme level, and is called acoustic condition modeling, which includes a variety of sound features from wide-area to peripheral information. Each level contains the following information.

  • Speaker level: A level that captures the overall characteristics of a speaker, representing the largest range of acoustic characteristics (e.g., speaker embedding).
  • Utterance level: A level that catches features that appear when pronouncing a sentence, and a mel spectrogram of a reference voice is used as input and a feature vector is output from it. When training the model, the target voice becomes a reference voice, and in inference, one of the voices of the speaker you want to synthesize is randomly selected and used as a reference voice.
  • Phoneme level: The smallest range of levels that capture features in units of phonemes in a sentence (e.g., strength for a particular phoneme, pitch, rhyme, and temporary ambient noise). In this case, the phoneme level mel spectrogram expressed by substituting the mel frames corresponding to the same phoneme with the average within the section is input. And in inference, although the structure is the same, we use an acoustic predictor that receives the hidden vector from the phoneme encoder as an input and predicts the phoneme level vector.

 

Conditional Layer Normalization

 

AdaSpeech's mel decoder consists of self-attention and feed-forward network based on the Transformer model, and since many parameters are used in it, the process of fine-tuning to new voice will not be efficient. So the authors applied conditional layer normalization to the self-attention and feed-forward network on each layer and reduced the number of parameters updated during fine-tuning by updating the scale and bias used here to suit the user. And the scale and bias used here are named conditional because they pass through the linear layer as above figure and these vectors are calculated from speaker embedding.

 

Training and Inference Process

The process of training AdaSpeech and inferring voice to new speakers can be summarized with the algorithm above. First, pre-train the source model with as much text-speech data as possible, and then update the parameters used for conditional layer normalization and speaker embedding with new speaker's speech data through fine-tuning. In inference, it can be seen that the value of the parameter that needs to be calculated from the speaker information and the value of the not fine-tuned through learning are utilized together to create a mel spectrogram.

 

Experiment Results

Custom Voice Quality Evaluation

 

MelGAN was used as a vocoder, and the naturalness of the synthesized custom voice was evaluated as MOS, and similarity was evaluated on a metric called SMOS. It can be seen that AdaSpeech can synthesize high-quality voices with only fewer or similar parameters than baseline. And since the source TTS model was pre-trained for a dataset called LibriTTS, of course, it seems to receive the highest score when adapted as a new speaker of LibriTTS.

 

Ablation Study

Using CMOS (comparison MOS), which can evaluate relative quality, they conducted an ablation study on techniques claimed as contribution in this paper. Since the CMOS of AdaSpeech, which removed certain parts, were lower than the basic AdaSpeech from Table 2, we can conclude that all techniques contribute to quality improvement.

 

Acoustic Condition Modeling Analysis

Figure 4(a) shows the learned speakers' utterance-level acoustic vector in t-SNE. It can be seen that different sentences pronounced by the same speaker are classified into the same cluster, and from this, it is judged that the model has learned the unique characteristics of one speaker when speaking a sentence. Some exceptions are seen, but these sentences are usually short or emotional speech, making it difficult to distinguish them from other speakers' utterances.

 

Conditional Layer Normalization Analysis

Compared with CMOS, it can be seen that the voice quality is the best when using conditional layer normalization. Therefore, when performing layer normalization, it is better to modify the scale and bias by reflecting the speaker's characteristics, and it can be summarized that updating only them has a positive effect on the model's adaptability.

 

Amount of Adaptive Data Analysis

Finally, the authors conducted an experiment to test how much new user's speech data is needed to determine if this model is practical. As can be seen from Figure 4(b), the quality of the synthesized voice improves rapidly until 10 samples are used, but since then, there is no significant improvement, so it is okay to fine-tune the AdaSpeech using only 10 samples for each speaker.

 

Conclusion and Opinion

AdaSpeech is a TTS model that has the ability to adapt to new users while making good use of the advantages of FastSpeech, which has previously improved speed with parallel speech synthesis. Acoustic condition modeling improves the generalization performance of the model by capturing the characteristics of the voice, and if it is further subdivided, AI that speaks more similarly to the user's characteristics may be created. In addition, I think the value of use is endless in that it is a model that can satisfy custom voice TTS with only 10 samples, but even so, it is regrettable in practical terms that the user's voice and corresponding text should be used as data for fine-tuning together. In fact, even if you can record your voice among those who use AI voice synthesis services, there will be more users who will be bothered to type text together. So, in the next session, we will introduce a modified version of AdaSpeech that allows custom voice synthesis without text-speech paired data.

 

Reference

(1) [FastSpeech 2 논문] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

(2) [AdaSpeech 논문] AdaSpeech: Adaptive Text to Speech for Custom Voice

(3) [AdaSpeech 음성 데모] https://speechresearch.github.io/adaspeech/

Reference

(1) [FastSpeech 2 Paper] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

(2) [AdaSpeech Paper] AdaSpeech: Adaptive Text to Speech for Custom Voice

(3) [AdaSpeech Demo] https://speechresearch.github.io/adaspeech/


[Deep.In. Article] A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

Deep Learning Team : Dunkin

Abstract

Lip sync technology, which generates the right movement of the lips for a given voice data, is one of the most popular field in deep learning. Let’s take a movie as an example. What if a foreign actor dubs according to the language of our country? Like an actor who lived in Korea for a long time, the meaning of speech will be expressed well, and the immersion will be much better. In addition it is not surprising that the news shows politicians from other countries speaking in Korean through deep learning technology. Therefore, natural and accurate lip sync technology is expected to bring a big leap forward to the future service and communication industry.

 

How will lip sync technology be implemented? It can be explained in two main steps. First, neural network learns to match the main coordinates of the lip shape syncing with sound.

Then, it learns to synthesize realistic lip based given set of mouth keypoints. The technology used at this step is the Generative Adversarial Network(GAN). This GAN is a type of neural network that releases outputs that has similar distribution with prior learned dataset which has certain features.

Let’s take an example. If the Bank of Korea teach neural network the shape or color distribution of the currency, it will be able to create realistic counterfeit note. Therefore, the neural network learns to make realistic human lip shapes if we teach the approximate main keypoints.

However, the network cannot easily learn the technique because the things that make realistic lip shape and synthesize the human lower jaw are very complicated tasks. In particular, if you irresponsibly pass on all of these complex homework to learn well to your network, it is easy to observe that the sound and lips don’t match with unrealistically synthesized faces.

 

Main Contribution of Paper

  1. A lip-synchronization network Wav2Lip structure that works well for input speech even in harsh condition was proposed with state-of-the-art performance.
  2. Benchmark and metric were proposer to evaluate the performance of lip-sync.
  3. They collected and provided a dataset called Real-world lip-Sync Evaluation(ReSyncED).
  4. When evaluating the synthesized video, more than 90% of people evaluated Wav2Lip performed better than previous lip sync models.

 

Previous SOTA Baseline : LipGAN Model

The author cited LipGAN[1], the previous SOTA network, as a baseline. A brief summary is as follows.

 

  • Type of Data
  1. Voice data transformed by MFCC(Mel-Frequency Cepstral Coefficient) technique.
  2. Image of the face of the target person to be synthesized(unsync image with voice data)
  3. Image of the face of the target person to be synthesized(bottom half of sync image is covered)

 

[Network Mechanism]

 

  1. The Audio Encoder(4 blocks) expressed in red color calculates the MFCC data.
  2. The Face Encoder(7 blocks) expressed in blue color calculates the synced face image(bottom half covered) and un-synced whole face image.
  3. Combine the audio embedding vector and the face embedding vector created through the two encoders(red and blue color).
  4. The green color Face Decoder(7 blocks) synthesizes the face from combined embedding vector. At this time, keep skip connection like U-Net so that face information can be well preserved and delivered to decoder. This decode process acts as a generator in GAN. (allocation L1 loss for reconstructing the target ground truth face image)
  5. The synthesized image and ground truth image(face synchronized with voice data) enter the yellow Face Encoder and change it to embedding vector through several operations.
  6. Similarly, the audio MFCC data used as input is mad into an embedding vector through a gray audio encoder(4 blocks).
  7. Contrastive loss allows the voice embedding vector and face embedding vector to become 0 if they are un-synced and 1 if the are synced.

 

Limitation

  1. Excessive amount of tasks were assigned to the generator. That structure teach the work of synthesizing realistic faces that can reveal the target human’s identity and the work of determining whether the lip movement is sync or not through synthesized images. In other words, not just study math and take math exam, but study math and English together and take exams for two subjects. Therefore, existing networks such as LipGAN model learn complex tasks at once, so it is difficult to synthesize appropriate mouth shapes.
  2. If you actually spend about 20 epochs on learning, almost half of the epoches are biased toward facial synthesis, and the lip synthesis is only after that. Therefore, learning the shape of the lips is only a few of the entire learning process. The author pointed out that the loss around the mouth is 4% lower performance than the pixel reconstruction.
  3. LipGAN synthesizes only one single frame. However, considering that the shape of the mouth is actually affected by the aforementioned voice, synthesizing image from multi-frame that can learn prior knowledge is more appropriate for natural mouth movements.

 

Wav2Lip Model

To improve LipGAN’s issues, the author proposes a structure called Wav2Lip.

  • Type of Data
  1. Voice data transformed by MFCC(Mel-Frequency Cepstral Coefficient) technique.
  2. Image of the face of the target person to be synthesized(unsync image with voice data)
  3. Image of the face of the target person to be synthesized(bottom half of sync image is covered)

 

  • Network Mechanism
  1. The Audio Encoder expressed in green color calculates the MFCC data.
  2. The Face Encoder expressed in blue color calculates the synced face image(bottom half covered) and un-synced whole face image. Unlike LipGAN, we used several consecutive frames instead of single frame.
  3. The audio embedding vector and face embedding vector made by two encoders are combined to pass through the decode and reconstruct the target ground truth image set. Here we allocate L1 Loss for reconstruction.

 

  1. Generated images and ground truth images are evaluated by Visual Quality Discriminator whether the image is realistic or not, about not voice sync but visual artifacts. Unlike LipGAN, binary cross entropy loss was used, not contrastive loss. They help removing visual artifact regardless of voice sync and focus only on realistic facial synthesis. It foster monster student who can solve problem about mathematics.
  2. It should be left to the expert to determine if the voice’s synchronization is excellent. Bring the Expert, a pre-trained Lip-Sync Discriminator, to evaluate whether synchronization is right between sound and image. The main point is that your network need to get a reliable score from a well-learned expert, otherwise they can’t develop their synthesizing skills. In this paper, they argue to bring a smart pre-train network that can professionally discriminate only synchronization. It can make the accurate sync loss between synthesized image and voice data. More precisely, cosine similarity loss is assigned to score 1 if the sync is right and 0 if it is not right.

 

Evaluation Metirc

  • Dataset
  1. LRW [4]
  2. LRS2 [5]
  3. LRS3 [6]
  • Dataset
  1. LRW [4]
  2. LRS2 [5]
  3. LRS3 [6]
  • SyncNet : LSE-D, LSE-C

SyncNet is a network that has emerged to determine whether a video is fake or not[2]. When you input mouth shape of video and voice MFCC data, the network outputs the distance is close if the sync is right. If the sync is wrong, they output far distance between audio embedding vectors and video embedding vectors.

 

At this time, Lip-Sync Error Distance(LSE-D) is used as the evaluation item to determine whether the frame and voice data sync is right.

 

 

If you give temporal offset between video frame and audio, we can compare the distance between audio and video embedding vectors. For the moment when the sync matches(where the temporal offset is 0), the LSE-D is small, and the offset increases, causing the distance to move away. Therefore, Lip-Sync Error Confidence(LSE-C), a kind of reliability indicator, has emerged to see that video and sound have fit sync part according to the change in distance value. They calculate the difference between the median value and the minimum value of distance.

 

  • FID (Frachet Inception Distance)

If you give temporal offset between video frame and audio, we can compare the distance between audio and video embedding vectors. For the moment when the sync matches(where the temporal offset is 0), the LSE-D is small, and the offset increases, causing the distance to move away. Therefore, Lip-Sync Error Confidence(LSE-C), a kind of reliability indicator, has emerged to see that video and sound have fit sync part according to the change in distance value. They calculate the difference between the median value and the minimum value of distance.

 

Results

1. Temporal Window: One of the big differences from Baseline's LipGAN is that Wav2Lip uses multi-frame as its input. In fact, as a result of learning by increasing the number of frames, it was found that both LSE-D and LSE-C showed good performances as the thermal window increased.

 

2. Pre-trained Discriminator : As a result of using the pre-train network Expert which helps to check only lip synchronization professionally, LSE-D and LSE-C evaluation items showed better performance than the existing Speech2Vid [3] and LipGAN models. refer to Wav2Lip (ours)

 

 

3. Visual Quality Discriminator : Unlike LipGAN, adding a discriminator that compares only vision images to determine real/fake showed a slight decrease in performance in LSE-D and LSE-C, but in terms of FID, visual image quality is much better. Therefore, you can express a much more realistic lip movement. It also received much higher preference and user experience scores. Refer to Wav2Lip + GAN(ours)

 

 

Conclusion and Opinion

It is a network that can synthesize much more accurate lip sync videos than previous models. It was impressive that it was not limited to the use of discriminators to remove visual artifacts, but that it further boosted performance with extraneous discriminators learned in advance for much better synchronization. In addition, various metrics and datasets were provided for performance evaluation, and they proved higher objectivity and reliability through preference score through user experience. In near future, motion presentation such as gesture and head pose will be added, and much of the research is already being conducted. It is expected that the lip sync synthesis model through deep learning will develop further and approach humans as a richer service.

 

Reference

[1] Towards Automatic Face-to-Face Translation

[2] Out of time: automated lip sync in the wild

[3] Adaptive subgradient methods for online learning and stochastic optimization

[4] Lip reading in the wild

[5] Deep Audio-Visual Speech Recognition

[6] LRS3-TED: a large-scale dataset for visual speech recognition

[7] U-Net: Convolutional Networks for Biomedical Image


[Deep.In. Interview] Software Developer : Gabriel Borges

https://www.youtube.com/watch?v=sOltmyTjNEY&t=72s

Let me introduce you to the honest story behind Deep.In. in DeepBrain AI
*DeepBrain AI cherishes colleagues who work together. Deep.In. is DeepBrain AI employee’s nickname is to appreciate the strong fellowship we have between employees.

English name : Deep.In.
Internatioal name : Deep.人.

 

Q. Please introduce yourself.
My name is Gabriel Borges, and I am a Brazilian developer working here in DeepBrain AI.

Q. What kind of work do you do in the Dev. team?
Here we make solutions to connect humans in AI technology. So, I work in here making web apps that use our technology, and I try to make the best experience possible for our customers so that they can use our products very easily and efficiently. Also I’m involved in all stages of development, which makes this job very dynamic and interesting.

 

Q. What efforts do you need to work for the DeepBrain AI development team?
You need to study a lot to keep yourself updated with all of the current technologies. So I have definitely learned a lot since I started in here.
And you know, if you want to work in here, it’s very important that you like learning new things.

Q. Tell me the advantages and disadvantages you felt while working in the development team.
The advantages are learning a lot and improving your career, so I feel that if that’s your goal, then this is a great opportunity.
But also the rhythm of working here is quite intense and there can be also a lot of pressure, so it’s important to also learn how to deal with those things.

 

Q. What do you think of Deepbrain AI?
DeepBrain AI’s a cutting-edge technology company, I have definitely learned a lot in here and grown professionally.
So also the people in here are great and very friendly that also helps to have the work easier just by having people giving you more support.

Q. What do you usually do during breaks at work?
We have a great-looking rooftop in here, so whenever I can go there and enjoy the scenery or talk to my coworkers, it’s always a great time. So I definitely like hanging out with people there a lot.

 

Q. Please say a word to the applicants who want to join the development team.
You’re definitely going to be very welcome in here. The work in here can be challenging, for sure, but it can also be dynamic and fun. I will definitely be looking forward to working with you if you join us.

Q. What is the prospect of the DeepBrain AI development team?
Even though I have started in here not long ago, I already have seen the company expand a lot. And I believe it will continue to do so in the near future. I believe there will be more employees and the structure will keep growing. That will also change the way we work in here, the work structure in, but also I feel like DeepBrain AI will have a much more international influence in the near future. So that’s what I see for the Deepbrain AI.


[Deep.In. Interview] I sell the future of AI Human.

https://www.youtube.com/watch?v=JsbWbjpRMbY&t=20s

Let me introduce you to the honest story behind Deep.In. in DeepBrain AI
*DeepBrain AI cherishes colleagues who work together. Deep.In. is DeepBrain AI employee’s nickname is to appreciate the strong fellowship we have between employees.

English name : Deep.In.
Internatioal name : Deep.人.

 

Q. Please introduce yourself.
Hello, my name is Aibek and I’m a global business development manager here at DeepBrain AI. I’m originally from Kyrgyz Republic, but moved to US and spent most of my adult life living in US and South Korea.

Q. What kind of work do you do in the Biz Dev. team?
Our global team’s primary job is lead generation and also outreach by bridging the knowledge gap that our potential client and partners have regarding AI, AI Humans and their applicability. I’m responsible for account management of enterprise clients and partners, global business development, networking and sales in strategic markets, such as US. I also occasionally pitch at various award competitions which is a lot of fun.

 

Q. How did you get to join DeepBrain AI?
I have been working in startup ecosystem for several years time now, primarily in IT, such as smart tech and Blockchain. DeepBrain AI’s unique technology and knowhow is what caught my attention and I wanted to be part of it.

Q. What skills should you have to work for a global sales team?
Our global team is very diverse, we come from various backgrounds and speak several languages, but we have one thing in common which is passion for technology, passion for global business and global mindset.

 

Q. What do you think of DeepBrain AI?
I’m satisfied with the fact that our welfare system is constantly improving, as our senior leadership always ask for our feedback and ways to improve or add employee benefits.

Q. Which welfare system are you most satisfied with?
I particularly enjoy family day and culture club activities that promote work and life balance, as well social communication between various teams.

 

Q. When did you feel rewarded while working?
As any business developer would say, I feel most rewarded when the deal gouth through, when we get that partnership and when the whole team can benefit from it.

Q. Please say a word to the applicants who want to join the global development team.
If you have passion for innovative technologies and you don’t shy away from challenges this is the place for you.

 


DeepBrain AI draws attention with AI Human technology in CES!!

https://www.youtube.com/watch?v=3KOzvEglic4

DeepBrain AI participated and successfully operated the on-site booth during CES 2022 in Las Vegas for three days from January 5th to 7th.

 

CES 2022, the world's largest consumer electronics exhibition, is a place where world-class innovative companies participate to show off their technologies and new products. This year’s CES were even more meaning full as it was first offline exhibition since the COVID-19. DeepBrain AI, participated in CES 2022, introducing their AI human technology by operating an independent booth, instead of a K-startup operated by the Ministry of SMEs and Startups.

 

In particular, AI Human, modeled anchor Moon Gun-young, an exclusive announcer of Arirang TV who recently signed an MOU with DeepBrain AI, made its debut at CES 2022.

 

DeepBrain AI also introduced 'AI STUDIOS', a SaaS based AI human solution. AI STUDIOS is a video synthesis and editing platform that allows for a quick video production featuring AI Human reading the script as users type. The biggest feature of AI STUDIOS service is that anyone can create videos featuring AI humans without a separate understanding of the technology or equipment for video and voice synthesis. Visitors to the booth were very impressed with the experience of video production utilizing AI STUDIOS.

 

In particular, the site drew keen attention to AI human solutions and AI STUDIO services, and active business discussions were also held with domestic and foreign officials from various industries, including domestic telecommunications companies, British broadcasting stations, French luxury brands, and universities based in the United States.

DeepBrain AI will continue to strive to find new business opportunities overseas beyond Korea to further expand the growth of AI Human Technology.

 


DeepBrain AI's successful NRF 2022 debut with 'AI Kiosks'.

https://www.youtube.com/watch?v=bOR3-e-kWUo

DeepBrain AI introduced the AI Human solution imbedded 'AI Kiosk' products using its core AI video synthesis technology at 'NRF 2022', the world's largest distribution fair held in New York, USA from the 16th to the 18th.

 

'NRF 2022' is an exhibition held every January by the National Retail Federation of America, the world's largest retail trade association. It started in 2013 and is now in its 10th year, and it is a large-scale event where you can check the trends of the global distribution industry in one place, enough to be called the CES of the distribution industry.

 

DeepBrain AI participating in the first NRF this year, attracted a lot of visitors showcasing its AI Clerk which is capable of real-time conversation just like a real clerk. “After the pandemic ends, there will probably be some dramatic changes in the commerce industry.” said Eric Jang, CEO of DeepBrain AI. “I believe they could come from AI and these innovations will affect various fields like retail, e-commerce, and live commerce.” he added.

 

DeepBrain AI presented their AI Human on the large screen at the Bird Tower in Times Square, New York.

Read More >>