Are Virtual Humans the Next Evolution of Chatbots & Voice Assistants?

We sit down with Joe Murphy, who leads US business development for DeepBrain AI, to explore the business potential of virtual humans, examine the technologies that enable them...

They go by many names, but AI-generated virtual agents that appear to talk, move and gesture like real people appear poised to take off in coming years. Recent fundraising milestones indicate that this space is heating up and investors see significant applications for this technology across both the real world and the emerging VR and AR landscape.

To dive deeper into this space, I recently sat down with Joe Murphy, who leads US business development for DeepBrainAI, to explore the business potential of virtual humans, examine the technologies that enable them, and the many challenges ahead as this approach to human-machine interaction matures and scales.

 

1. Joe, thanks for your time. First, can you help orient us as to what an “AI human” is, and what are some killer use cases? 

Hi Eric, thanks for the opportunity to share. For this conversation, I will use the terms, “AI Human, Virtual Human, and Digital Twin” interchangeably. In general, they are variations on a common theme: a digital representation of a person that looks, sounds, and acts as the actual person.

On the topic of “killer use cases,” I believe that AI Humans represent the logical evolution of virtual assistants and chatbots. So, augmenting any chatbot with an AI Human is the killer use case. What I mean is that most existing and future chatbot use cases can be enhanced with a Virtual Human. For example, why when speaking to Alexa or Siri, why do I receive a disembodied voice response from a black box? This is then followed by some awkward conversational turn-taking which is haphazardly guided by flashing lights and pulsing icons.

Previous technical limitations could create a case study for the uncanny valley and the disembodied voice assistant made sense. More recently, video synthesis technology has progressed to the point where Virtual Humans can be indistinguishable from actual humans. So, we are no longer constrained to having a conversation with a faceless black box.

Not to sound overly enthusiastic, but I compare the upcoming Virtual Human chatbot evolution to several other technology shifts, where the video was preferred and overtook the audio-only solution.

 

Entertainment: Radio → TV

Communication: Phone Call → FaceTime Call

Business: Conference Bridge → Zoom Meeting

 

Each of the paradigms above was noticeably improved with the addition of video. Adding human-centric video almost always creates a more enjoyable and natural interaction. So, we fully expect that adding AI Humans to chatbots will follow this same pattern of acceptance and adoption.

 

Read More >>


DeepBrain AI, Top 20 Korean AI Startups to watch for – Best of 2022

Startups in Korea are looking to incorporate AI technology into their existing products or service. The reason is that a lot of Korean investors are very interested in investing in Korean AI startups. The Korean government said that it will invest over $330 million in the development of processing-in-memory (PIM) chips. This is in addition to their pledge of $830 million into AI semiconductors from 2020 to 2029.

The Korean government has already built 6 new AI schools in 2020/2021 to educate Korean engineers in AI technology. In addition, corporations are also looking to get into AI technology. Furthermore, Samsung, LG, Naver, Kakao, and Hyundai have shown great interest to invest in AI technologies. Therefore, it is clear that both the Korean government, investors, and companies think AI is an important technology for Korea.

 

Here are the top 20 AI startups in Korea to keep an eye on for 2022

 

1. DeepBrain AI

DeepBrain AI is a Korean AI startup that researches and develops conversational AI technology. They offer a wide range of AI-powered customer service products. However, their specialty is in synthetic humans that are able to respond to natural language questions. They use AI technology to offer video and speech synthesis and chatbot solutions to enterprise customers such as MBN, Metro News, LG, and KB Kookmin Bank. They showcased their AI human imbedded “AI Kiosks” at CES 2022. It leveraged the power of AI with its human-based AI avatars.

 

AI Human Avatars

In order to create AI Human avatars, DeepBrain AI first captures a video of a human model in a studio and then uses AI to analyze the model’s lip, mouth, and head movement. Therefore, it is great for creating virtual AI bankers, teachers, and even news anchors. Their revenue in 2021 was $5 million due to the rise in demand due to COVID. Moreover, DeepBrain AI has been able to raise over $50 million to date led by Korea Development Bank. The startup is currently valued at $180 million.

“There is only a very limited number of companies both locally and globally with high-quality deep learning-based voice synthesis technology. We will achieve excellent performances in diversified business domains with our competitive technology, which is at par with those of global companies,” said CEO of DeepBrain AI, Eric Jang.

 

Read More >>

 

 


DeepBrain AI, Determined as Top Ten Finalists of T-Challenge

We are very happy to announce the TOP 10 teams of our T-Challenge on XR experiences. The ideas focus on virtual shops, products and service experiences, avatars, as well as technologies and platforms that improve the XR experience.

The T-Challenge, organized by T-Labs and T-Mobile U.S., enables teams from all over the world to show and discuss their ideas for future #XR customer experiences.

The finalists are now entering the second phase of the competition. The goal is to further elaborate their ideas until May 22nd. In the “Solution Development” stream solutions should be presented as a Minimum Viable Product (MVP). In “Concept and Design Creation”, a tangible prototype is required.

Besides the prize money of up to €150,000, the teams have the chance to see their solutions integrated in selected Telekom shops in Germany, Europe and the United States.

 

Conversational human-alike AI avatar

DeepBrain AI creates digital twin avatars that express the same appearance, voice, accent, and behavioral movement as real ones. With this video synthesis technology, you can either easily create videos by typing text or speak to digital twins in real-time, which will enable enterprises to engage their customers more efficiently and humanly.

Read More >>


Virtual Humans: The Video Evolution for Metaverse-Bound Voice Assistants, Brand Ambassadors, and Media Personalities heading to the metaverse

Text-To-Speech (TTS) is the tech-du-jour for most voice assistants. It makes no difference if someone interacts with Alexa, Siri, Google, or others; the responses are typically TTS audio playing out of a smart speaker, mobile phone, or automobile speaker. The current voice assistant paradigm of speaking to a black box and receiving a disembodied voice response works with the interaction models of today, but this doesn’t translate well to the metaverse we see on the horizon.

Enter a host of new start-up companies all in a race to develop “Virtual Humans” or “Digital Twins.” They are creating what will most likely be the next generation of conversational interfaces based on more natural, authentic, and humanistic digital interactions. So why Virtual Humans, and why now? A few technology drivers and socioeconomic factors have created the perfect storm for real-time video synthesis and Virtual Humans.

 

https://youtu.be/ftOqULabRQM

TECHNOLOGY DRIVERS
When compared to conversational TTS responses, there is no doubt that video synthesis solutions require higher workloads (CPU+GPU) to generate video and higher payloads (file size) to deliver video. However, ever-increasing CPU and GPU performance and increased availability speed up the video synthesis process in the cloud and on edge. Also, advances in batch processing and smart caching have enabled real-time video synthesis that rivals TTS solutions for conversational speed. So, the bottleneck of generating ultra-realistic video on the fly has been primarily addressed. This leads to delivering video in real-time, which, thanks to broadband speeds over both Wi-Fi and 5G, is now readily available to most homes, businesses, and schools. You can see the comparison in the video below.


HELP (AND CONTENT) WANTED
Businesses that require employees to engage with customers, such as hotels, banks, or quick-service restaurants, are having trouble hiring and retaining new employees. A lack of available and qualified employees can damage the customer’s perception of the brand and create a real drain on revenue. Enter the Virtual Humans that can handle basic requests quickly and consistently. In Korea, both 7-11 and KB Bank have installed AI Kiosks that rely on a Virtual Human to interact with customers. The 7-11 implementation supports a man-less (or woman-less) operation.

Another promising vertical for Virtual Humans is media, both broadcast media and social media (influencers). Whether streaming news 24 hours a day or staying relevant on TikTok, the need is the same: generate more video content and make it faster. Once again, Asia has taken the lead with Virtual Humans. Television stations such as MBN and LG HelloVision both supplement their live broadcasts with Virtual Human versions of their lead anchors that provide regular news updates throughout the day. Using either API calls or an intuitive “what you type is what you get” web interface, videos with Virtual Humans can be made in minutes without needing a camera, crew, lights, make-up, etc. A time-saving and cost-saving tool that can be intermixed throughout the day to keep content fresh.

“What is our strategy for the Metaverse?” That question is being asked within conference rooms across all sectors. It is easy to imagine how brands leveraging the 2D Virtual Humans of today for taking orders, helping, sharing news will quickly evolve to be the early pioneers of the 3D world and the metaverse. Watch throughout the year for some big announcements within this space.

 

Read More >>


DeepBrain AI participates ‘NVIDIA GTC 2022’, Presenting AI Human Technology and Research outcome

▶ DeepBrain AI CTO Kyung-Soo Chae presents time-reducing, lip sync video synthesis technology.
▶ Participates in “Digital Human and Interactive AI” session panel to introduce current status of AI Human industry.

 

DeepBrain AI, an artificial intelligence (AI) company, announced on the 29th that it participated in the "NVIDIA GTC 2022," the world's largest AI developer conference, and announced AI human-based research results and overall technology.

The "NVIDIA GTC (GPU Technology Conference)" is a global technology conference organized by NVIDIA, a leader in AI computing technology, and was held as an online live session from March 21 to 24. Companies from all over the world participate in more than 500 live sessions, including deep learning artificial intelligence, data science, high-performance computing, and robotics technology, to showcase technology and discuss related agendas with participants.

 

DeepBrain AI, Chief Technology Officer (CTO) Kyung-Soo Chae attended the session presentation under the theme of "Toward Real-Time Audiovisual Conversation with Artificial Humans," which enables real-time video and voice-based communication.

CTO Kyung-Soo Chae representatively introduced DeepBrain AI's lip-syncing video synthesis technology that enables real-time conversations between AI humans and people. In addition to the technology to generate high-resolution lip-sync videos at four times the speed of time through its own artificial neural network structure design, it also announced research results that succeeded in reducing synthesis time to 1/3 by applying NVIDIA's deep learning inference optimization SDK.

 

In addition, DeepBrain AI participated as a panel of "Digital Human and Convergent AI" sessions to announce AI human technology and specific business status. Through real-time questions and answers that followed, it actively communicated with industry officials and introduced DeepBrain AI's AI human technology use cases and future business plans.

 

DeepBrain AI CTO Kyung-Soo Chae said, "The area that takes the most time and money to implement AI humans is video and voice synthesis, and DeepBrain AI's unique technology has announced the results of reducing high-quality video synthesis time in 1/12 of real time."

 

Read More >>


Introducing AI Yoon Seok-Yeoul appearing on the campaign trail for the 20th president of South Korea.

The PPP(People Power Party) introduced AI Yoon Seok-Yeoul as part of their campaign strategy for the 20th presidential election.
DeepBrain AI developed AI Yoon Seok-Yeoul which was then utilized by PPP to actively communicate with national voters.

 

"AI Yoon Seok-Yeoul" first appeared at the inauguration ceremony of the presidential election, and said, "The first AI in politics symbolizes the new future and challenge of Korea," adding, "We will be visiting you all over the country."
During the actual election campaign, the PPP utilized AI Yoon Seok-Yeoul to produce personalized targets for each region into personalized videos and distributed them to party members. Which was then mounted on the large screen of the campaign car to explain local pledges.

 

Looking into technological advantages of AI Yoon Seok-Yeoul,

First, there are no location and time constraints.
AI Yoon Seok-Yeoul has no location and time constraints. Unlike people, he can appear in any part of the country at any given time.
Candidate Yoon Seok-Yeoul, a real person, was able to focus on campaigning in more important strategic areas.
It has also become an alternative to non-face-to-face communication factoring COVID-19.

Second, customized election strategies for each region.
Along with the familiar landmark background of each region, our neighborhood pledge and our city and provincial pledge were implemented by AI Yoon Seok-Yeoul.
AI Yoon Seok-Yeoul made a customized pledge video just by typing text and delivered it to voters through text, SNS, YouTube, and campaign vehicle screens.

Third, reduced the cost of promotion.
If a real person were to be featured in a video of a customized pledge for each region, it would have cost a lot for production such as studios, filming equipment, and staff. Or if candidates would travel all of the regions directly, it costs precious time and money.
AI Yoon Seok-Yeoul was produced in one to two weeks, and after that, it reduced the cost of promoting the election significantly.

 

It is expected to be used in the upcoming national local elections for local government heads.
Furthermore, we look forward to the day when AI presidents and AI mayors play a role in listening to and resolving complaints from the people and citizens.

Read More >>


DeepBrain Brings AI to Life at the NAB SHOW.

Visit DeepBrain AI to witness the future of Digital Twins for broadcast and social media.

 

 

 

Are you ready for your very Digital Twin?

News stations across Asia are already using DeepBrain AI’s technology for creating “AI Anchors.” DeepBrain AI representatives will be on-site to share information on the process: from the video shoot to deep learning to delivery. Our approach creates an ultra-realistic AI model that matches the actual person’s appearance, voice, gestures, and mannerisms.

 

 

 

 

 

 

AI Studios

DeepBrain AI’s software platform makes videos as easy as typing a script.
AI Studios eliminates investment into high-cost resources required by traditional video production. No cameras, no lights, no crew, type your script, and our video synthesis technology generates the video in seconds.

 

 

 

 

 

Conversational AI humans

In addition to presenter-based or informational videos, DeepBrain AI’s video synthesis can also be paired with chatbot software to create a fully interactive and conversation experience.

 

 

Booth Information.

 

Find us at NAB Show 2022

Booth number: C2627 in Central Hall
Check or click the floor map below to navigate your way to our booth.

 

DeepBrain AI will be exhibiting at #NAB at Las Vegas Convention Ceter in April 24~27.
We will be demonstrating conversational AI Humans and automatic video generator, AI Studios on site. You can come visit us at our booth and learn more about how you can benefit from AI Human technology.
To arrange a meeting contact Ahmad Elzahdan (teamusa@deepbrainai.io)

 

 

 

 

 

More Details on #ASUGSV Summit.


Meet DeepBrain AI Team at ASU+GSV Summit.

We are inviting you to witness the future of AI Human powered on-offline education and training.

 

 

 

Create your own AI Human

DeepBrain AI creates digital twin of a real person by utilizing deep learning and video synthesis technology. Your AI version will look and speak just like you.
The AI version of the real model will look and speak exactly the same, becoming a real digital double.

 

 

 

 

 

 

 

 

 

Conversational AI Humans

Speak, ask, engage with AI Humans in real-time.
Enterprises can deploy this AI solution for more personalized customer engagement as if it was a
1-on-1 in-person consultation available 24/7.

 

 

 

 

 

 

 

 

 

 

AI Studios

AI Studios eliminates the need for investment into high-cost resources required by traditional video production.
Just by typing the script, this enables AI Humans to naturally speak, use body language and gestures just like real presenters.

 

 

 

 

 

 

 

 

 

Find us at ASU+GSV!

Join us in Booth #213A

 

 

DeepBrain AI will be exhibiting at #ASUGSV in San Diego in April 4~6.
We will be demonstrating conversational AI Humans and automatic video generator, AI Studios on site. You can come visit us at our booth and learn more about how you can benefit from AI Human technology.
To arrange a meeting contact  (teamusa@deepbrainai.io)

 

 

 

News More >>

 

More Details on #NAB


[Deep.In. Article] AdaSpeech2: Adaptive Text to Speech with Untranscribed Data

Deep Learning Team : Colin

Abstract

Like the AdaSpeech model we looked at last time, the existing TTS adaptation method has used text-speech pair data to synthesize the voices of a specific speaker. However, since it is practically difficult to prepare data in pairs, it will be a much more efficient way to adapt the TTS model only with speech data that is not transcribed. The easiest way to access is to use the automatic speech recognition (ASR) system for transcription, but it is difficult to apply in certain situations and recognition accuracy is not high enough, which can reduce final adaptation performance. And there have been attempts to solve this problem by joint training of the TTS pipeline and the module for adaptation, which has the disadvantage of not being able to easily combine with other commercial TTS models.

AdaSpeech2 designs an additional module that can combine any TTS model together to enable learning with untranscribed speech (pluggable), and from this, it propose a model that can produce results equivalent to the performance of the TTS model fully adapted with text-speech data (effective).

Summary for Busy People

  • Additional modules were attached to the structure of AdaSpeech to induce adaptation to specific speakers using only speech data.
  • Mel Encoder's latent space is trained to be similar to Phoneme Encoder's latent space, so Mel Decoder can receive the same features regardless of whether the input comes in text or speech. This is suitable for situations where only speech data must be input into the pre-trained TTS model.
  • AdaSpeech2's adaptation method can be used by attaching any TTS model and can produce similar performance to models that have adapted certain speakers with text-speech pair data.

Model Structure

AdaSpeech2 uses AdaSpeech, which consists of a phoneme encoder and a mel-spectrogram decoder, as a backbone model. Acoustic condition modeling and conditional layer normalization are used like the existing AdaSpeech, but are not expressed in the figure above for simplicity. Here, add a mel-spectrogram encoder that receives and encodes speech data, and apply L2 loss to make it similar to the output of the phoneme encoder. The detailed learning process will be explained below.

Training and Inference Process

Step 1. Source Model Training

First of all, it is important to train the source TTS model well. Train the phoneme encoder and mel-spectrogram decoder of the AdaSpeech model with a sufficient amount of text-speech pairs, where duration information to extend the output of the phoneme encoder to the length of the mel-spectrogram is obtained through the Montreal Forced Alignment (MFA).

Step 2. Mel Encoder Alignment

If you have a well-trained source model, attach a mel-spectrogram encoder for untranscribed speech adaptation. Finally, it plays a role in creating features that will enter the mel-spectrogram decoder while auto-encoding the speech, and it needs to be made to be the same as the latent space of the phoneme encoder because it has to spit out the same output as the feature from the transcription data (text). So, as we proceed with TTS learning again using text-speech data, we obtain and minimize the L2 loss between the sequence from the phoneme encoder and the sequence from the mel-spectrogram encoder, leading to the alignment of latent spaces between the two. At this time, this method can be expressed as pluggable because it does not retrain the entire structure, but fixes the parameters of the source model and updates only the parameters of the mel-spectrogram encoder.

Step 3. Untranscribed Speech Adaptation

Now fine-tune the model using only the (untranscribed) speech data of the specific speaker you want to synthesize. Since the input speech is synthesized back to speech via mel-spectrogram encoder and mel-spectrogram decoder, it is a speech restoration method through auto-encoding, in which the source model updates only the conditional layer normalization of the mel-spectrogram decoder and minimizes computation.

Step 4. Inference

Once all of the above adaptation processes have been completed, the model can now mimic the voice of a particular speaker through a phoneme encoder that has not been fine-tuned and a partially fine-tuned mel-spectrogram decoder when text is entered.

Experiment Results

Adaptation Voice Quality

In Table 1, joint-training is a setting used as a baseline in this experiment by learning both phoneme encoders and mel-spectrogram encoders at the same time, and the strategy to learn phoneme encoders and mel-spectrogram in order is judged to be superior.

In addition, the performance of the Adaspech and PPG-based models used as backbone was considered to be the upper limit for the performance of AdaSpeech2, so we conducted an experiment to compare them together. From the results of MOS and SMOS, we can see that AdaSpeech2 synthesizes voices of almost the same quality as models considered upper limits.

 

Analyses on Adaptation Strategy

 

Ablation study was conducted to evaluate whether the strategies mentioned earlier in the learning process contributed to the improvement of the model's performance. As a result, the quality of the voice deteriorates if L2 loss is removed between the output of the phoneme encoder and the mel-spectrogram encoder, or the mel-spectrogram encoder is also updated in the fine-tuning step.

 

Varying Adaptation Data

When the number of adaptive speech data samples is less than 20, the synthesis quality improves significantly as the amount of data increases, but if it goes beyond that, there will be no significant quality improvement.

 

Conclusion and Opinion

Machine learning engineers who train TTS models know that the quality of data is synthetic quality, so they put a lot of effort into collecting and preprocessing data. And in order to synthesize voices with new speakers, new speakers' speech files and transcribed text are collected in pairs to re-train the TTS model from scratch, but using the AdaSpeech2 method, data only needs to be collected and the model needs to be fine-tuned. Another advantage is that it is easy to apply in reality because it can be combined with any TTS model.

If we proceed with further research in AdaSpeech2, it could be an interesting topic to observe the resulting performance changes using new distance functions such as cosine similarity as constraints instead of L2 loss.

Next time, we will have time to introduce the last paper of the AdaSpeech series.

Reference

(1) [AdaSpeech2 paper] AdaSpeech 2: Adaptive text to speech with untranscribed data

(2) [AdaSpeech2 demo] https://speechresearch.github.io/adaspeech2/

 


From models to presidential candidates… DeepBrain AI is expected to lead the technology of AI humans to the next level in a new era for AI

https://www.youtube.com/watch?v=6XgAu-m7xQ8&feature=emb_title

DeepBrain AI has successfully commercialized world's first real-time video synthesis solution in the form of both conversational and non-conversational 'AI Human' technologies and applied them to the various idustries. Recently, SBS, a leading national South Korean broadcasting, introduced many different types of virtual humans including DeepBrain AI’s AI humans. 

 

SBS reports that AI-powered humans modeled after real humans are currently applied to the various industries as AI bankers, AI shop assistants, lecturers, and etc. 

 

In the news clip, AI presidential candidate Yoon Seok-yeol, ‘Wiki Yoon’, produced by Deep Brain AI was highlighted. The news mentioned that the AI presidential candidates have emerged as a new election campaign strategy to overcome the limitations of non-face-to-face elections due to COVID-19 and AI Yoon is hugely popular among young South Koreans.

 

DeepBrain AI’s ‘Detect Deepfake AI(detectdeepfake.ai)’ was also mentioned. Detect Deepfake AI is a service that verifies the authenticity of the video through deep learning AI analysis-learning of videos suspected of forgery and alteration. DeepBrain AI launched the service and distributed it for free, hoping that it would minimize the damage caused by deepfake videos. 

 

Check out the video news clip above to learn more about AI humans and DeepBrain AI’s services and technology.