Microsoft patent proposes rendering a 2D representation of a user as an Avatar avatar in 3D space

3,552 0

CheckCitation/SourcePlease click:XR Navigation Network

(XR Navigation Network January 22, 2024) Video conferencing can be used to allow one or more conference participants to join a conference remotely. Hybrid conferencing may include multiple participants joining the meeting through a number of different mediums, such as in person or remotely. In this regard, hybrid conferencing allows participants to join a meeting using a large number of XR technologies.

Microsoft believes that the rise of telecommuting has fueled the development of videoconferencing technology. However, traditional videoconferencing currently offers only limited options for integrating remote participants into hybrid meeting environments. Additionally, specific telework technologies can be computationally intensive.

So in the patent application, titled "Representing two dimensional representations as three-dimensional avatars," Microsoft proposes that it is possible to render a user's 2D representation as an avatar in 3D space, thereby improving user engagement while providing other benefits such as reducing computational effort. avatars in 3D space, thereby improving user engagement while providing other benefits, such as reduced computation.

Specifically, the invention describes mechanisms that provide improved user engagement by creating a perspective view of a user based on one or more 2D views received from one or more input video streams at a relatively low computational cost compared to conventional systems that can convert 2D representations into 3D avatars.

Additionally, planar objects can be rendered using a variety of techniques, such as through the Neural Radiation Field NeRF. it is possible to combine a multi-view image into a volumetric representation to generate a better view than the original view. The volumetric representation can generate a new view of the object without reference geometry by projecting the output color and density into the image through the rendering light.

Microsoft patent proposes rendering a 2D representation of a user as an Avatar avatar in 3D space

FIG. 1 shows a system 100. example system 100 may be used to represent a 2D image as an avatar in a 3D space, such as a hybrid meeting space. As shown in FIG. 1, said system 100 comprises a computing device 102, a server 104, a video data source 106, and a communication network or networks 108. said computing device 102 may receive video streaming data 110 from said video data source 106, which may be, for example, a webcam, a webcam, a video file, and the like.

The computing device 102 may include a communication system 112, a real subject recognition engine or component 114, and a virtual subject generation engine or component 116.

The computing device 102 may perform at least a portion of the real subject identification component 114 to identify, locate, and/or track a subject, e.g., a person, an animal, or an object, from the video stream data 110. Alternatively, at least a portion of the virtual subject generation component 116 may be performed at the computing device 102 to determine a view, planar object segmentation, or planar object configuration of a subject in the video stream data 110.

A view of the subject may be determined based on a set of video stream data 110 and/or a plurality of video stream data sets 110. Alternatively, said planar object may be segmented based on one or more pivot points of said subject. Optionally, the planar object may be generated or otherwise configured based on a trained machine learning model.

In one embodiment, a trained classifier may be used to detect people based on 2D video, and the skeletal pose of each person.The 2D video may be received from an RGBD video source.The RGBD video may be from a range of motion sensing input devices.

The skeleton can be used as a guide for simple geometric proxies. A general human model may be used to project pixels from one or more video sources into a new view. Alternatively, depth data obtained from video streaming data 110 may be used to segment a human from a video, generate a better agent geometry than a planar object, or construct a volumetric representation.

The server 104 may include a communication system 112, a real subject identification engine or component 114, and a virtual subject generation engine or component 116. in one embodiment, the server 104 may perform at least a portion of the real subject identification component 114 to identify, locate, and/or track a subject, e.g., a person, an animal, or an object, from the video stream data 110.

Alternatively, the server 104 may execute at least a portion of the virtual subject generation component 116 to determine a view, planar object segmentation, or planar object configuration of a subject in the video stream data 110, such as a subject identified by the real subject identification component 114.

In one embodiment, the computing device 102 may communicate data received from the video data source 106 to the server 104 via the communication network 108, which may perform at least a portion of the real subject identification component 114 and/or the virtual subject generation component 116.

In one embodiment, the video data source 106 may be located locally to the computing device 102. For example, the video data source 106 may be a camera coupled to the computing device 102. Alternatively, the video data source 106 may be remote from the computing device 102 and may communicate the video streaming data 110 to the computing device 102 over a communication network.

Microsoft patent proposes rendering a 2D representation of a user as an Avatar avatar in 3D space

The hybrid conference 200 depicted in FIG. 2 may be held in a room 204. In one example, the room 204 may be a physical room. Alternatively, the room 204 may be a virtual room. A plurality of participants 208 may participate in the hybrid meeting 200.

A first subset 208A of the plurality of participants 208 may participate in the hybrid meeting 200 via videoconferencing. a second subset 208B of the plurality of participants 208 may be physically present at the hybrid meeting 200. a third subset 208C of the plurality of participants 208 may wear a head-mounted display 216 to view one or more of the generative avatars 210 that correspond to the remote participants.

Each of the first subsets 208A of the plurality of participants 208 may be located away from the room 204. the first subsets 208A may each have a computing device and/or a video data source that is local to each participant in the first subset 208A and allows each participant in the first subset 208A to participate in the hybrid conference 200.

Each of the second subsets 208B of the plurality of participants 208 may be physically located within the room 204. A third subset 208C may be physically located in the same room as the second subset 208B or in a remote location. The room 204 may include one or more cameras 220 disposed therein to generate a video stream of one or more participants from the second subset 208B to one or more remote users.

The generated avatars 210 corresponding to the one or more participants in the third subset 208C may be visible to at least some of the participants 208 away from the hybrid meeting 200. Alternatively, at least some of the plurality of participants 208 may view the generated one or more avatars 210 by wearing a head-mounted display 216.

In one embodiment, the headset 216 may be a virtual reality headset, or the headset 216 may be an augmented reality headset.

With respect to FIG. 1, the one or more cameras 220 may be analogous to the video data source 106 of the preceding description.Said one or more cameras 220 may collect a batch of still images taken at successive time intervals. Alternatively, the one or more cameras 220 may collect a single instance of a still image taken at a particular moment in time.

Optionally, one or more cameras 220 may collect a live feed of video data. The one or more cameras 220 may be configured at specific locations within the room 204 based on user preference.Alternatively, there may be any number of cameras 220 that may be present within the room 204 based on various advantages or disadvantages.

Microsoft patent proposes rendering a 2D representation of a user as an Avatar avatar in 3D space

FIG. 3 illustrates one or more views of a generated subject 302. The subject 302 may be a person, animal, and/or object generated based on video streaming data. Alternatively, the subject 302 may be generated based on one or more still images and/or one or more animations corresponding to a physical version of the subject 302. The subject 302 shown in embodiment 300 is a virtual subject, which may be similar to the generated avatar 210 of FIG. 2.

The subject 302 may be located within the hybrid meeting space. Alternatively, the main body 302 may correspond to a physical body located away from the hybrid meeting space. Said subject 302 may be segmented into a plurality of planar objects 304. said plurality of planar objects 304 may be a plurality of billboards. the plurality of billboards 304 may be planar surfaces that transform with respect to one another. The planes may be 2D planes and/or surfaces. Alternatively, the planes may be rotated at an angle with respect to each other, i.e., not coplanar.

Said body 302 may include one or more pivot points around which two or more portions of said body 302 are configured to rotate. For example, the body 302 is a person, wherein the one or more pivot points are joints of body parts configured to rotate about it. The one or more billboards may be split at the pivot points. The one or more billboards may be connected by virtual joints, which may be pivot points.

The one or more billboards 304 may be bent over one another, stretched to overlap or intersect one another, blurred over one another, blended with one another, discretely adjacent to one another, or otherwise visually manipulated to form a virtual subject corresponding to a physical subject.

Each of the one or more billboards 304 may be translated and/or rotated in 3D space to form a perspective view of a portion of the body 302 contained in the translated and/or rotated billboard 304.

In embodiment 300, one or more of the billboards 304 corresponding to the body of said body 302 may be intentionally rotated such that the geometry of said body 302 is visible to a viewer.

Determining which aspects of the subject 302 are visually superior to the observer-oriented can be configurable by the user. Alternatively, determining which aspects of the subject 302 are visually superior to an observer-facing person may be learned by a machine learning model. For example, participants in a hybrid meeting may engage more frequently with a subject having a billboard arranged in a particular manner.

Such examples may include an observer-facing face, a user-facing palm, a user-facing chest, and the like.

In one embodiment, a skeleton of the subject 302 may be generated or detected based on the video stream data. The skeleton may be rendered to a new view in which each limb of the subject 302 is aligned along its respective axis and faces a reference point corresponding to the new view. Said reference point may be a camera of a computing device. The skeleton may maintain the same alignment as the original video, and a single billboard may be generated for the entire body of subject 302.

Alternatively, if the subject's 302 hand touches the whiteboard, a separate billboard can be generated for each bone of the subject's 302 hand to satisfy the constraints of the new view.

The billboard 304 may be used to generate additional environmental effects that enhance the realism of the blended scene, such as rendering reflections of the subject 302 onto the conference table, rendering shadows of the subject 302's hands and body onto the table and/or walls, or even casting shadows onto a physical conference table in a physical room to better represent remote participants.

Optionally, the machine learning model may be used to segment the body 302 into a plurality of billboards 304. the machine learning model may be trained to generate the segmented billboards 304 based on a plurality of constraints or factors.

For example, the plurality of constraints may include a computational cost of generating the segmented billboard 304, one or more input video streams, an error between the physical body formed by the segmented billboard 304 and the generated avatar. The physical body corresponding to the generated avatar may be found in the one or more input video streams.

Furthermore, generating a relatively large number of billboards may increase computational overhead, although generating a relatively large number of billboards may reduce the error between subject 302 and the corresponding physical subject. Thus, training of the machine learning model may further rely on factors that weight multiple variables based on user preferences and/or economic factors.

It should be recognized that the billboard itself may not be real. Whenever a billboard is split into multiple billboards, annoying artifacts may result. Therefore, it is beneficial to use as few billboards as possible, e.g., when the mapping of a subject through a single billboard falls short of a specific desired constraint.

The plurality of billboards 304 may be output in the output video stream. For example, the plurality of billboards 304 may form a body 302, and the body 302 may be output in an output video stream. Said output video stream may be output to a computing device. Thus, after generating the body 302, a user may interact with the body 302 via the plurality of billboards 304 in the output video stream.

Microsoft patent proposes rendering a 2D representation of a user as an Avatar avatar in 3D space

FIG. 4 illustrates an embodiment method 400. method 400 may be a method of representing a 2D image of a user as an avatar in 3D space.

Starting at 402, one or more input video streams are received. For example, the input video streams may be received from a video data source. The input video streams may include 3D scenes. The system may recognize 3D sub-scenes of the input video streams and divide or segment subjects within the 3D sub-scenes.

At 404, it is determined whether the input video stream contains a subject. The visual data may be processed using the mechanisms described in the invention to recognize the presence of one or more individuals, one or more animals, and/or one or more objects of interest.

Also at 404, one or more subjects can be identified by using specific software(. For example, a person may be identified by logging into a particular application. Thus, said person can be identified when logging into a particular application.

If it is determined that the input video stream does not contain a subject, go to operation 406. for example, the input video stream may have an associated preconfigured action. In other examples, method 400 may include determining whether the input video stream has an associated default action, and as a result of the received input video stream, no action may be performed. Method 400 may terminate at operation 406. Alternatively, method 400 may return to operation 402 to provide a continuous video stream feedback loop.

However, if it is determined that the input video stream contains a subject, go to operation 408 in which a first subject is identified in one or more of the input video streams. The first subject may be a person, an animal, or an object. Alternatively, where there is a plurality of input video streams, the first subject may be identified in each of the plurality of input video streams, for example by visual identification of common features in each of the input video streams corresponding to the first subject.

The process advances to operation 410, wherein a first view of the first subject is identified based on the one or more input video streams. The first view may be a front view, a side view, a rear view, a top view, a perspective view, or any type of view. The first view may be recognized relative to a reference point or shape within a room in which the first subject is located. The first view may be identified based on the geometry of the room. Alternatively, the first view may be identified based on a relative rotation of the first subject in one or more input video streams.

The process advances to operation 412, wherein a second view of the first subject is identified based on the one or more input video streams. The second view may be a front view, a side view, a rear view, a top view, a perspective view, or any type of view. The second view may be recognized relative to a reference point or shape within the room in which the first subject is located.

At 414, it is determined whether the second view of operation 412 is substantially different from the first view of operation 410. For example, if the second view is substantially the same as the first view, it may be computationally advantageous to output one of the first view or the second view from one or more of the input video streams, as no further processing needs to be performed based on the first view and the second view.

Alternatively, determining whether the second view is inherently different from the first view may include calculating an error between the visual component of the first view and the visual component of the second view. The user may configure a predetermined threshold for determining at the threshold that the second view is inherently different from the first view.

If it is determined that the second view is not fundamentally different from the first view, it goes to operation 406 where the default operation is performed. For example, one of the first view or the second view from one or more input video streams may be output without further processing to reduce computational effort.

在其他示例中，方法400可包括确定第一视图或第二视图是否具有关联的默认操作，以便在作为第一主体的识别的第一视图和/或第二视图的结果，不执行任何操作。方法400可以在操作406时终止。或者，方法400可以返回到操作402以提供连续视频流反馈回路。

然而，如果确定第二视图与第一视图实质上不同，则转到操作416，其中根据第一主体的第一和第二视图将第一主体分割成多个平面对象。所述多个平面对象可以是多个billboard。如本文前面就图3所述，billboard的多个可以是相对于彼此平移的平面。

通过合并多个视图以在其折叠构型中生成第一主体的3D化身，以向一个或多个观察者提供第一主体的视角。因此，多个billboard中的每个billboard都可以面对摄像头，使得用户可以看到多个billboard中的每个billboard的内容。

所述第一主体可包括一个或多个枢轴点，所述第一主体的两个或多个部分配置为围绕其旋转。例如，如果第一个对象是人或动物，则一个或多个枢轴点可以是关节，身体部位可以围绕关节进行旋转。可以在枢轴点处分割一个或多个平面对象。另外，可以通过虚拟关节连接一个或多个平面对象。虚拟关节可以是对应于第一主体的物理关节的旋转关节点。

一个或多个平面对象可以彼此弯曲，拉伸以相互重叠或相交，彼此模糊，彼此混合，彼此离散地相邻，或以其他方式在视觉上处理以形成对应于物理主体的主体。

另外，或者可选地，可以使用机器学习模型将第一主体分割成多个平面对象。可以训练机器学习模型以基于多个变量生成分段的平面对象。

操作418，可以输出多个平面对象。多个平面对象可以在输出视频流中输出。例如，多个平面对象可以形成生成的3D化身。所述输出视频流可以输出到计算设备，例如头显216。

微软表示，这种方法可以通过相对较少的计算成本来改进现有的3D化身生成方法，同时依然向观看者提供视角信息，从而改善用户的体验。

名为“Representing two dimensional representations as three-dimensional avatars”的微软专利申请最初在2022年6月提交，并在日前由美国专利商标局公布。

Generally speaking, after a U.S. patent application is examined, it will be automatically published 18 months from the filing date or priority date, or it will be published within 18 months from the filing date at the request of the applicant. Note that publication of a patent application does not mean that the patent is approved. After a patent application is filed, the USPTO requires actual review, which can take anywhere from 1 to 3 years.

In addition, this is only a patent application, which does not necessarily mean that it will be adopted, and it is also uncertain whether it will be actually commercialized and the actual results of its application.