Mill member profiles

Ankush Bikkasani: Deep Word

by Jun 8, 2021

When it comes to synthetic video, seeing is believing. Synthetic video pairs any face with any words, in any language, in any voice. It’s created not by a person filming an actor on a set, but by artificial intelligence (AI), in cyberspace.

If you’re visualizing something like bad dubbing on a vintage samurai movie, stop. Synthetic video looks astonishingly real. Take a look at this demo from Deep Word, the company founded by CEO Ankush Bikkasani.

Mill member Ankush, a senior at the Indiana University Kelley School of Business, launched Deep Word in November 2020. Since then, Deep Word has already racked up 19,000 users, who have generated 38,000 synthetic videos.

Ankush, who first got his hands on his father’s video camera at age six, is deeply aware of the pain points associated with traditional filming. As a freelance videographer, he experienced them firsthand.

headshot of Ankush Bikkasani

“Going to a location, setting up a camera, setting up lighting equipment—all the equipment in the exact same spot—having a person sit in front of the camera, talking for an hour, doing multiple takes when people are worried about their hair or what they look like . . . It was a very repetitive process.”

“So I started to read into deep fakes and how they worked, how they were generated, the processing power necessary to create them, how good they looked. It was really intriguing, but unfortunately, the workflow that we needed just wasn’t there with deep fake software.”

“Deep fakes are essentially a very high-quality face swap: putting your face on mine, or my face on yours. But what we needed was the ability to change or modify the things that somebody was saying.”

“That’s when I started talking to some of my friends at Indiana University who are getting their masters in data science, and we started working on some prototypes. Some very interesting research came out last summer. We started putting that together with some other research and things that we’d found, and eventually, we came up with the prototype that’s deployed to our website currently.”

Here’s how it works: users can select one of Deep Word’s video actors or upload their own video, and then supply either a script or an audio file. Within minutes, Deep Word’s technology generates a seamless match between the original video and its new audio. Need an English version and a Spanish version? No problem. Just send the script through Google Translate or another service, paste it into their software, and create a second version in Spanish. It’s incredibly cost-efficient and fast, with countless applications.

There are a few limitations: the technology works best with a relatively stationary actor facing the camera, for example. But that’s improving quickly. Even over the last six months, Deep Word’s demo videos have added more hand gestures and movement.

“One of the largest opportunities that we see developing is in e-learning,” Ankush explains. “Lots of really good research shows that when kids, or even adults, have a face to associate with the information they’re learning, they retain almost up to 40% more information. And a lot of content being produced doesn’t have that. So it’s a very easy and cheap value proposition to these companies. We’re basically communicating, ‘Hey, if you’re producing content, or if your past content doesn’t have a teacher, just integrate with our software, and you can automatically have teachers overlaid over all of your content.’”

Deep Word’s AI will convert text to audio spoken through an artificial voice called a neural voice. Deep Word trains its neural voices on 30-40 hours of data of people talking. Eventually, they hope to offer users the ability to clone their own voices (without having to sit behind a microphone for a week).

“A neural voice is kind of a buzzword. It just means a very high-fidelity, as realistic as possible voice. They sound pretty good. We still have a ways to go with them, though.”

Deep Word also works with realtors and brokerages. “Essentially what we’re offering is a service: if you send the properties, images, and listing descriptions, we can create video tours of each property featuring their face and voice.” Online real estate listings with videos receive over 400% more buyer inquiries, but most listings don’t include video because of the prohibitive cost of traditional filming.

The impact of video is not just from showing the interior space in a more dynamic way, Ankush adds. “The added personability of having someone’s face and voice, our early results show that that’s pretty powerful. It’s a very easy way for realtors to stand out from other listings online and differentiate themselves beyond just the homes that they’re selling.

This summer, Deep Word will launch an API that allows companies to generate videos at scale, to personalize content for their individual employees or customers within a few minutes. Consider the value of this personalization for large companies using a learning management system (LMS) to train and onboard new employees. Typical LMS onboarding is dull and generic: pages and pages of information that new employees skim, at best. Strong onboarding processes, on the other hand, can improve retention by 82% higher retention rate.

“So if a company has a database of employees that they want to onboard, they can pass very specific information about each individual employee to our servers, and they can have video generated addressing each one of those employees, personalized for each employee.”

“For example, if I was a new employee at The Mill, and they were using our software to onboard new employees, they would know that my name is Ankush. They would know my job position, say, a marketing position. So the video would start by addressing me, addressing anything specific to The Mill that’s happening at the time, and then it would address the daily responsibilities that I have, very specific to my exact position. The power of the API allows for one-on-one personalization, at scale.”

“Deep Word has really been able to prove out the technology with individual users,” said Cy Megnin, Elevate Ventures’ entrepreneur-in-residence serving Velocities, a partnership supporting startups in south-central Indiana. “What has me most excited about this company is the release of its API, which will allow video production to be truly scalable.”

Although it’s very early in the field, other deep fake software is already online and open for anyone to use. In fact, Deep Word has competitors in Synthesia and, although their technology works differently. “They are essentially puppeting faces,” Ankush explains. “Every time their software sees a new face, they have to train a model to output video with that face. Ours is a generalized model, meaning that it will work with video of anyone without further training, so it’s a much faster and more versatile process. If I wanted to integrate a thousand video actors into our website in the next hour, I could, but for them, each one would take several days of model training and integration, which is why Synthesia charges $500 to $1,000 per video actor.”

In addition to CEO Ankush, the Deep Word team includes two data scientists and a software engineer, all IU graduates. In 2020, Deep Word won a $20,000 pre-seed award in the Elevate Nexus Regional Pitch Competition. “That was huge for us,” Ankush notes. “It was a really great validation of our product right before we launched.” They also secured $100,000 in Amazon Web Services (AWS)—another important win, since Deep Word processes all their videos and trains their models through AWS, that saved the company 85% of their operational costs.

So far this year, Deep Word placed first at the Clapp IDEA Competition and second at the Cheng Wu Innovation Challenge, and secured an additional $20,000 investment from the Community Ideation Fund (run by Elevate Ventures through the Velocities partnership) to enable further technological improvements.

“Our ultimate goals are one, higher resolution output—having our software return completely photorealistic results—and two, real-time video production.”

“Right now, if you generate a five-minute video, it might take ten to fifteen minutes to generate. But being able to generate that instantly, in addition to hand gesture and body movement synthesis, is what we’re aiming for over the next eight months. So a much more complete video suite. It’s not just somebody talking to a camera, but now they’re moving around, demonstrating things, and it’s also happening in real-time.”

This kind of potential—to quickly and easily put words into anyone’s mouth, on video, in an incredibly realistic way—raises obvious concerns about ethics, as well as business concerns about regulation. Ankush and his team have established strict ethics for using their product, and they’re prepared to comply with regulations, if that happens.

“At the end of the day, we only want content being produced that is intended to be produced by the people who are in the video. And we take a lot of measures to really put our foot down and say that this is how it’s going to be. We monitor all the content produced through our website. We’ve developed auto flagging systems for content, and we’re working on an internal video verification tool.”

Ankush sees potential in that verification tool not only as an internal solution for Deep Word, but as a large market opportunity in itself.

“It could become the standard for verifying if a video has been produced synthetically. A video file contains metadata. There are ways that you can set up this metadata to indicate if the original video file has been tampered with, where it’s from, and if it was intended to be created in the first place.”

“I think synthetic video is a hundred percent here to stay. It’s just too much of an improvement—or its potential is too much of an improvement—over how we currently produce video. And I think that regulators will understand that.”

When Ankush explains his business to people, he says, most of them don’t immediately recognize the ethics issue, until they see an actual synthetic video. “Then they’re like, ‘Oh, wait, this is pretty realistic.’ I have to explain how seriously we do take it. It’s an evolving field. It’s a very gray area. The ultimate goal is that we and other companies hold the same ethical grounds, but we can’t always guarantee the perceptions of others.”

Within that gray area, there is also positive potential. For example, Deep Word’s technology makes it easy to increase the diversity of faces, voices, and languages represented in training and educational videos—an important shift not only for effective learning, but for ethics and representation.

Currently, Deep Word is a freemium SaaS (software as a service). Anyone can sign up and create five videos per month. Users who want to create more or longer videos with higher resolution output can pay for a subscription, with pricing tiered by usage. Most current customers currently, especially the power users generating many videos per month, are individual creators.

“We’re working with a few enterprise companies,” Ankush says. “For example, we worked with PDQ, a Southern fast-food chicken company, and created onboarding videos for every single one of their fast-food job positions.”

“Enterprise companies have much larger video needs than most individual users on our website, even users that are paying. The API will allow them to generate all their video needs at scale. That’s a much more attractive solution to enterprises than one-by-one video generation via our website.”

Given the huge potential of the technology, the early stage of the market, and Deep Word’s fast user growth, how is Ankush handling the startup life?

“I was always drawn towards entrepreneurship, but I don’t think it’s ingrained in me in the same way that a lot of people say it is. My passion is creating things, being creative, making things, and entrepreneurship happens to lend itself pretty well to that personality.”

“Some days you wake up excited, and then you go to sleep defeated. Somehow sleep resets it. It’s been an emotional roller coaster, and I’m confident that’s how it’s going to be for the remainder of the startup. It’s very fulfilling.”

Learn more about Deep Word