Microsoft AR/VR patent shares video streaming based on eye gaze position
CheckCitation/SourcePlease click:XR Navigation Network
(XR Navigation Network December 22, 2023) Point-of-attention rendering relies on the fact that the human visual system can only see sharp detail in a focal area of 5-10 degrees in the central retinal recess, and beyond 10 degrees, detail perception rapidly decreases to 20% or less. By utilizing point-of-attention rendering, the XR headset can greatly reduce the rendering computational burden.
Indeed, such asMicrosoftIn fact, vendors such as Microsoft are actively exploring other areas beyond rendering applications, such as video streaming. In a patent application titled "Gaze based video stream processing," the company describes a type of video streaming based on eye gaze position.
In one embodiment, the computing system may utilize an gaze estimation system to estimate the user's gaze position, thereby allowing the processor to reduce the quality of a video stream that the user is not actively viewing.
The video stream may be processed differently based on whether or not it is to be displayed within the area of the user's gaze point. The gaze position may be estimated, detected or otherwise determined by a computing system using said image sensor.
That is, the image sensor may acquire one or more images and estimate the gaze position based on said one or more images. The estimated gaze position may then be used to process the video streams to be displayed to the user, such as decreasing the image quality of video streams displayed outside of the estimated gaze position, increasing the image quality of video streams displayed within the estimated gaze position, and the like.
FIG. 1 depicts a video processing system 100 configured to process a video stream based on an estimated gaze position of a user. The video processing system 100 includes a computing device 110 and a display device 120.
Computing device 110 is typically configured to receive a plurality of video streams and provide a representation of the video streams to display device 120 for display to user 102.Examples of computing device 110 include a web server, a cloud server, or other suitable computing device.
The computing device 110 may include a stream processor 112 that processes video streams for transmission to the display device 120. in various embodiments, the stream processor 112 is configured to reduce a transmission bit rate of the at least one video stream before transmission to the display device 120. Computing device 110 may also include a gaze detector 114 configured to identify an estimated gaze location of user 102. gaze detector 114 is configured to utilize a neural network model, such as neural network model 162.
The display device 120 includes an image sensor 132 having a field of view and capable of acquiring one or more images of the user 102, and the display device 120 and/or the computing device 110 utilize the images captured by the image sensor 132 to identify an estimated gaze position of the user 102.
The image sensor 132 may acquire one or more images of the user 102, wherein the user 102 is located within a field of view of the image sensor 132. The one or more images acquired by the image sensor 132 may then be provided to a neural network model executing at a neural processing unit.
The neural network model may determine and provide to the stream processor 112 information about the gaze of the user 102, such as an estimated gaze position. Because the neural processing unit is specifically designed and/or programmed to handle neural network tasks, its consumption of resources is less than that of using a central processing unit.
The gaze information determined and provided by said neural network model may include an estimated gaze position of said user 102. The estimated gaze position of the user may correspond to the position of the display 130 and/or surroundings, such as X, Y, Z coordinates.
The display processor 144 is configured to perform one or more image enhancement algorithms at the one or more video streams, such as, for example, a super-resolution algorithm that increases the spatial resolution or frame rate of the video streams, a sparse reconstruction algorithm, a focus decoding algorithm, or other suitable image enhancement algorithms.
In other words, said display processor 144 is configured to process a first video stream having a relatively low image quality to generate a second video stream having a relatively high image quality. Said display processor 144 may utilize said estimated gaze position to select a subset of said received video streams for use in said image enhancement algorithm, such as video streams that are only within said estimated gaze position.
In this manner, the computing device 140 provides a high-quality video stream when the user is watching, but a low-quality video stream when the user cannot readily discern additional details, which saves processor cycles for other activities.
The neural network model 162 is configured to estimate the user's gaze position based on one or more images of the user. The neural network model 162 may be trained to estimate the gaze position using the source image 164, and the neural network model 162 may be trained to recognize regions of interest. The video data 166 may include recorded video, a video stream, or data that may be used to generate a video stream.
FIG. 2 depicts an example of a video processing system 200. The video processing system 200 includes a computing device 210, a first display device 220 for a first user, a second display device 230 for a second user, and a third display device 240 for a third user. the computing device 210 generally corresponds to the computing device 110 of FIG. 1 and includes a stream processor 212. the computing device 210 may also include an attention detector 214.
A first display device 220 generally corresponds to the display device 120 of FIG. 1 and includes an attention detector 222 and a display processor 224. the first display device 220 captures a first video stream 226 using an image sensor such as an image sensor 132. additionally, the first display device 220 recognizes an estimated gaze position 228 of a first user. said first display device 220 transmits said first video stream 226 and said estimated gaze position 228 to said computing device 210.
The second display device 230 is configured to display the video stream received from the computing device 210 and capture the second video stream 236 using a suitable image sensor. for example, said image sensor may be similar to said image sensor 132.
The third display device 240 generally corresponds to the display device 120 of FIG. 1 and includes a gaze detector 244. the third display device 240 captures the third video stream 246 using an image sensor, such as the image sensor 132. additionally, the third display device 240 recognizes an estimated gaze position 248 of the third user. the third display device 240 transmits the third video stream 246 and the estimated gaze position 248 to the computing device 210.
The stream processor 212 is configured to reduce the transmission bit rate of at least one of the second video stream 236 and the third video stream 246 prior to transmission to the first display device 220. By lowering the transmission bit rate, computing device 210 reduces the amount of bandwidth required to transmit composite video stream 250 to first display device 220. Additionally, when displaying the composite video stream, the lower transmission bit rate simultaneously provides a lower power consumption or a faster display frame rate for the first display device 220.
In various embodiments, the stream processor 212 reduces the transmitted bit rate of the video stream by reducing the pixel count, reducing the frame rate, changing the color palette or color space, changing the video encoding format, reducing the audio quality, or any combination thereof. As an example, the stream processor 212 reduces the pixel count or resolution of the video stream by resampling from 1920 x 1080 pixels to 1280 x 720 pixels or by cropping to a smaller size.
As another example, the stream processor 212 reduces the frame rate from 60 frames per second to 30 frames per second or 24 frames per second. In another example, the stream processor 212 changes the video encoding format to a more efficient encoding format, such as from H.262 format to H.264 or H.265 format.
In one embodiment, the stream processor 212 performs the above processing by decoding the video stream to obtain the decoded data, and then encoding the decoded data in a different video coding format or changing the parameters of the video coding format to reduce the transmission bit rate. In other embodiments, the stream processor 212 transcodes the video stream into a different video coding format (.
FIG. 5 depicts a method 500 for processing a video stream.
Beginning at step 502, a plurality of video streams are received for transmission to a display device. Said plurality of video streams have respective initial image quality levels. Said plurality of video streams corresponds to said video streams 226, 236, and 246, and said display device corresponds to the first display device 220 in the embodiment.In another embodiment, the plurality of video streams corresponds to video streams 360, 370, and 380, and the display device corresponds to the display device 130.
At step 504, an estimated gaze position of the user of the display device is recognized. The estimated gaze position is received from the display device, e.g., from the gaze detector 142 or 222. In other embodiments, one or more images are received from the display device, a plurality of features are extracted from the images, the plurality of features are provided to the neural network, and the estimated gaze location is identified as a location at which to direct the user's gaze using that neural network.
At step 506, at least one of the plurality of video streams is processed to have a modified image quality level based on the estimated gaze position. The modified image quality level is less than a corresponding initial image quality level. The modified image quality level has at least one of a reduced number of pixels, a reduced frame rate, and an increased compression.
At step 506, a modified image quality level is selected from a plurality of quality levels based on a distance between the at least one video stream displayed by the display device and the estimated gaze position.
At step 508, the plurality of video streams are transmitted to said display device. In one embodiment, the composite video stream generated and transmitted to the display device comprises at least one processed video stream having a modified image quality level and a residual video stream of the plurality of video streams.
FIG. 6 depicts a method 600 for processing a video stream.
Beginning at step 602, a plurality of video streams are received for display by the display device. Said plurality of video streams have respective initial image quality levels. The initial image quality level is relatively low, e.g. to reduce the transmission bit rate of the video streams.
At step 604, an estimated gaze position of the user of the display device is identified. In one embodiment, the estimated gaze position corresponds to an estimated gaze location. In one embodiment, the estimated gaze position is determined by a gaze detector of the display device, e.g., by gaze detector 142 or 222, and provided to display processor 144.
At step 606, at least one of the plurality of video streams is processed to have a modified image quality level based on the estimated gaze position. The modified image quality level is higher than a corresponding initial image quality level. The modified image quality level has at least one of an increased number of pixels, an increased frame rate, and a reduced compression.
In one embodiment, the display processor 144 performs one or more image enhancement algorithms at the at least one video stream, such as a super-resolution algorithm that increases the spatial resolution or frame rate of the video stream, a sparse reconstruction algorithm, a focus decoding algorithm, or other suitable image enhancement algorithms.
In one embodiment, the display processor 144 selects video streams within the estimated gaze position and performs an image enhancement algorithm on only the selected video streams.
In step 608, the plurality of video streams are displayed by a display device.
"Gaze based video stream processing".Microsoft patentThe application was originally filed in August 2023 and published by the USPTO a few days ago.
Generally speaking, after a U.S. patent application is examined, it will be automatically published 18 months from the filing date or priority date, or it will be published within 18 months from the filing date at the request of the applicant. Note that publication of a patent application does not mean that the patent is approved. After a patent application is filed, the USPTO requires actual review, which can take anywhere from 1 to 3 years.