Beyond the Fakery:
Using Deepfake Technologies to User-Test Products and Train Intelligent Assistants
My colleague, Mike Kuniavsky, and I have recently been thinking a lot about how companies can use deepfake technologies to dramatically accelerate the implementation of various types of virtual experiences.
Deepfakes are a dramatic use of AI to generate synthetic media, like videos or audio, that’s realistic enough to fool observers. This kind of realistic representation of complex objects — including people —often depicted in motion, within realistic settings, used to require the efforts of highly trained technicians andartists — think Hollywood special effects. But deepfake technology now allows almost anyone to generate similarly realistic effects.
The illusions this tech can create are at the same time exciting and worrisome. But as highlighted in this article, deepfakes themselves (which, as the name implies, are inherently deceptive) are just one potential use of the underlying capability. The core capability is the use of AI and ML to automate the creation of realistic representations. Those digital thi can come in many forms, including audio, video, and even XR experiences.
In Accenture’s Digital Experiences R&D Group, we envision this boosting our ability to digitalize and transform the product design process, the work experience, and the consumer experience. And all without any deception! We call this broader class of use cases “Automatically Synthesized Experiences” (ASEs), because they don’t depend on fakery. In the rest of this article, we’ll describe two use cases we’re excited about.
Letting people experience a design before it’s been physically made
One class of ASE use we’re currently exploring involves accelerating the creation and experience-testing of virtual prototypes. This involves transforming the product design process by allowing humans — both designers and potential customers — to experience products and facilities while they’re still in the process of being designed. In other words, people can experience the product before even mockups or prototypes have been physically created. By automatically creating convincing, perhaps immersive ASEs as ways to virtually “pre-experience” many variations of a design, these systems to help designers rapidly explore a large space of possibilities. Both designers and test users can vividly experience using a product that doesn’t really exist yet.
In combination with extended-reality interfaces, simulation technologies, and automated analysis of user feedback, this will allow designers to rapidly explore possibility spaces. It also means they can create and richly test many more design variations, without incurring the cost or time required to physically realize all the variations. In this future design process, we think the role of the human designer will change to become more like a design lead, with automated design assistants creating ASEs based on designs that are automatically derived from the general parameters that the human designers generate. In effect, humans create the specs for a family of designs and automated systems take the general models and constraints, along with other contextual knowledge, as input to machine learning systems that generate proposed designs. And at the same time, they can also generate the ASEs that allow potential customers to experience the proposed product design visually, or perhaps in an immersive VR experience. It will even be possible for users to test specific design variations that the human designer hadn’t thought of!
Ultimately, these technologies will power both production of both higher-quality designs, and of more customized designs: more testing using ASEs will allow designers to better meet the needs of a user population. And the ability to automatically generate design variations will allow for designs that are tailored to smaller long-tail markets than would otherwise be feasible within time and cost budgets.
Imagine seeing a video of yourself, moving in ways that realistically depict your day-to-day activities, body motions, and posture, wearing a custom-designed outfit before it’s been manufactured for you. Or picture an apparel designer using a digital design tool that with a visual interface that allows adjusting of high-level style parameters, such as fabric, shape, and more in, say shoe design, and rapidly seeing the changes depicted on shoes in action on a variety of synthesized human models. Or consider an interior designer who is working with a client, analyzing the tradeoffs between various furniture styles and arrangements, and how various combinations change the feel of a room. Imagine that as the choices are specified, an immersive VR experience is automatically synthesized, letting the customer experience “living” in different variations to decide which options feel best.
In all of these examples, the big change is how rapidly new possibilities can go from a high-level specification to a sophisticated, high-fidelity experience for the customer. What was once only theoretically possible with an army of specialized talent and significant elapsed time is increasingly feasible much faster, on a DIY basis.
Training intelligent assistants to recognize what they see and hear
Another important use we see for ASEs involves depictions that aren’t directly consumed by humans at all.
Although computer vision systems can be used to power a wide range of customer experiences, they can only recognize things they’ve been trained to “see.” Training data for these systems can be difficult to come by, there may not be enough pre-existing data, and creating the necessary images or videos may be difficult. Creating training inputs in the form of audio, video, or images can be time consuming and expensive, and collecting it can also create privacy or security problems. But what if we used ASEs to train these machines? ASEs can serve as an alternate source, improving the capabilities of intelligent systems and other recognition system, eliminating the large feasibility and cost hurdles involved in gathering the data in the traditional way.
This technique is already in wide use in autonomous car research, where autonomous car AIs learn to navigate simulated street scenes with a variety of vehicles and hazards long before the robot cars try to drive on real streets. This approach has proven to be highly effective at reducing the amount of work required to train the Ais (though it’s still substantial), and we’re applying the lessons learned to several other domains.
We envision a consumer experience that’s greatly enhanced by digital assistants. Think about an assistant that can coach you in a store as you consider what product to buy, help you set up or use a product you’ve already bought, or help you to fix it if it breaks. Imagine a system that recognizes the assemble-at-home furniture as you’re putting it together in your apartment, and can coach you that you’re putting screw A in the hole intended for screw B.
Or consider the workplace experience. Imagine you want to upskill your workforce with the newest techniques for performing a type of repair job, and you want digital assistants to both provide up-front training, and also on-the-job coaching as workers master the new activities. Or in a factory setting, imagine an assistant that can warn you if you’re performing an operation unsafely because it can recognize objects and activities visually, and match them to proper procedure.
A key to providing this kind of help is the assistant’s ability to recognize — through vision, sound, or other sensors — what’s going on: what objects are in the immediate environment, what the user’s doing with them, and so forth. But a key hurdle to developing robust visual recognition capabilities through machine learning is getting access not just to a large volume of data for training, but data that has both its relevant and irrelevant aspects clearly identified. For example, to test and perfect their machine learning algorithms for image recognition, academic researchers rely on a huge public image database called ImageNet. By feeding ImageNet’s labeled images of, say, puppies and kittens, AI models learn to distinguish puppy photos from kitten photos when evaluating a test scene. (The same is true for other modalities; a robust recognizer of sounds — including voice — or for that matter or of touch sensations, smell, etc). Similarly, web-scale companies accumulate their own, proprietary data sets that they can use to train their systems to recognize faces or other common objects.
The problem is that if you want to train a recognizer to identify the new product you’re giving to your workforce, there may be no images of that product in use for the AI to train with, much less the hundreds or thousands of images required to clearly distinguish it from similar products. Or say you’re trying to create an intelligent assistant to help someone with their shopping. You want to understand what exact part of your product consumers are inspecting when they pick it up in the store so the app can give a useful answer when the consumer asks a question like, “What is this part for?” There aren’t going to be large libraries of pre-existing images or videos you can use to train your recognizers. And there certainly won’t be anything that comes close to putting the items you want to recognize into the specific context where you want to recognize them, with the relevant lighting conditions, and so forth, all of which is very helpful to get the most robust possible recognizers.
If you have a lot of resources, you can have a team create the photos, audio clips, and so forth that you need to serve as training data. But that is both time consuming and very expensive. So this is where ASEs come in. Instead of creating “real” images (or sound, or activity sensor data), with a camera, you can instead use the special effects techniques involved in creating deepfakes to synthesize the data you need to train your system. Using digital models, you have of the things you want recognized, you can run algorithms that will synthesize the experiences of them seeing them, or hearing them, or, for that matter, experiencing them through other senses such as touch or smell — in a variety of simulated combinations and contexts, experienced from all angles with all kinds of light, acoustics, and so forth. Instead of creating a deepfake to fool a human viewer, you’re generating realistic synthetic data to train robust recognizers. And you can skip gathering all the real-world training data that would otherwise be needed (and which might not even be possible to get).
Real value, no fooling
For some time now, it’s been clear that technology can allow us to experience things that we don’t have real, physical access too. You probably already spend a fair amount of time in simulated digital fantasy worlds, whether in high-resolution immersive video games or watching special effects-heavy movies and TV shows. But the time and cost required to produce those depictions has limited the range of applications that are practical.
ASEs make it possible to expand our experiences with products and services, from design all the way through to use, interaction and even repair — with no foolery involved.
— -
Special thanks to Digital Experiences team member Dylan Snow, whose input on this document and technical work helped Mike nd I go beyond these theoretical ideas to explore practical applications of ASEs
To learn more about our work in this space, contact Alex Kass or Mike Kuniavsky, and read our latest report, “Deepfake, real value: flipping the script on deepfake technologies”.