Meta and Stanford University use generative AI to realize human-computer interaction in 3D space environment, opening the door to high-level dynamic human-computer interaction for AR/VR
CheckCitation/SourcePlease click:XR Navigation Network
(XR Navigation Network 2023年12月12日)斯坦福大学和Meta旗下的FAIR团队日前介绍了一种突破性的人工智能系统:仅根据文本描述就可以在虚拟人和物之间产生自然的同步运动。
This new system is called CHOIS (Controllable Human-Object Interaction Synthesis/Controllable Human-Object Interaction Synthesis), and it uses the latest conditional diffusion model technology to produce seamless and precise interactions, such as "lifting the table above your head, walking , put down the table”.
Looking to the future, future virtual creatures will be able to understand and respond to language commands as smoothly as humans, and systems can generate continuous human-computer interactions from language descriptions.
The team noted that synthesizing human behavior in 3D environments is critical for applications such as computer graphics, embedded artificial intelligence, and robotics. While humans can effortlessly navigate their environments and perform tasks, this is a daunting challenge for robots and virtual humans, as every task requires precise coordination between people, objects, and their surroundings.
Language, on the other hand, is a powerful tool for expressing purposeful intentions. Synthesizing realistic human and object motion, guided by language and scene context, is the cornerstone of building advanced artificial intelligence systems.
The Stanford University and FAIR teams believe that although there have been studies exploring human-scene interaction problems, they are limited to scenes with static objects, ignoring the highly dynamic interactions that frequently occur in daily life. In addition, although the industry has recently made progress in modeling dynamic human-object interactions, related methods only focus on smaller objects or lack the ability to manipulate a variety of objects. Even though there are explorations to manipulate various objects of larger size, they rely on the past sequence of interaction states or the complete sequence of object motion, and cannot synthesize object motion and human motion from the initial state alone.
So in CHOIS research, the team focused on synthesizing realistic interactions involving different objects of larger size from language and initial states.
Generating continuous human-computer interactions from linguistic descriptions poses several challenges. First, we need to generate realistic and synchronized object and human motion. During the interaction process, the human hand should maintain appropriate contact with the object, and the movement of the object should maintain a causal relationship with the human behavior.
Secondly, there are often a large number of objects in 3D scenes, which limits the space of feasible motion trajectories. Therefore, interactive synthesis must adapt to the chaos of the environment rather than operating under the assumption of an empty scene.
For CHOIS, the team focused on the key issue of synthesizing human-object interaction in a three-dimensional environment from natural language commands, generating object motion and human motion under the guidance of language and sparse object path points.
Movement should be consistent with the instructions specified in the language input, while complying with environmental constraints defined by waypoint conditions derived from the 3D scene geometry. To achieve this, the researchers employed a conditional diffusion model to simultaneously generate synchronized object and human motion, conditioned on linguistic descriptions, initial states, and sparse object waypoints.
In order to improve the accuracy of predicting object motion, object geometry loss is added during the training process. In addition, they designed guidance terms applied during the sampling process to improve the realism of generated interactions.
Experiments demonstrate the effectiveness of the learned interaction synthesis module in the system, which can produce continuous realistic and context-aware interactions given a linguistic description and a 3D scene.
Through the conditional diffusion model, the CHOIS system can simulate and generate detailed motion sequences. CHOIS is able to generate a sequence of motions when given an initial state of human and object positions, as well as a verbal description of the desired task.
For example, if the command is to move the lamp closer to the couch, CHOIS will understand the command and create a realistic animation for the human avatar to pick up the lamp and place it near the couch.
What makes CHOIS particularly unique is that it uses sparse object waypoints and language descriptions to guide animations. Waypoints act as markers for key points in an object's trajectory, ensuring that movement is not only physically sound but also consistent with the goals outlined by the language input.
CHOIS is also unique in that it combines language understanding with physics simulation. Traditional models often struggle to connect language to space and physical actions, especially in the context of longer interactions, and they must consider many factors to maintain authenticity.
CHOIS bridges this gap by interpreting the intent and style behind verbal descriptions and then interpreting them into a series of physical movements that respect the constraints of the human body and the objects involved. The system ensures that contact points (such as a hand touching an object) are accurately represented and that object motion is consistent with the force exerted by the virtual human.
CHOIS系统可以对一系列的领域产生深远的影响,特别是在动画和虚拟现实领域。如果人工智能能够解释自然语言指令并生成逼真的人机交互,CHOIS可以大大减少制作复杂场景动画所需的时间和精力,而且在虚拟现实环境中,CHOIS可以带来更加身临其境的交互式体验,因为用户可以通过自然语言命令虚拟角色,并看到它们以逼真的精度执行任务。
This high level of interaction can transform the VR experience from rigid scripted events to dynamic environments that respond realistically to user input.
The team believes their study is an important step towards creating advanced artificial intelligence systems that can simulate continuous human behavior in different 3D environments. It also opens the door to further research into synthesizing human-computer interaction from 3D scenes and language input, and may lead to more complex artificial intelligence systems in the future.