Qualcomm Patent Introduces Scene Segmentation and Object Tracking Methods for Joint Scene Segmentation, One-Time Long-Term Object Tracking

3,776 0

CheckCitation/SourcePlease click:XR Navigation Network

(XR Navigation Network 2023年12月19日）扩展现实设备可以通过摄像头来检测、追踪和识别目标事件或对象。对象检测可用于检测或识别图像或帧中的对象，并且可以执行对象追踪，以便随时间追踪检测到的对象。同时，可以执行图像分割来检测帧中的多个片段，对其进行分析和处理以执行所需的图像处理任务或生成所需的图像效果。

现有的场景分割和/或目标追踪解决方案有一定的局限性。例如，用于对象追踪的解决方案的示例包括在前一帧中目标对象的位置附近的区域中执行本地搜索。然而，执行这种局部搜索依赖于基于前一帧的预测，这可能适合于短期追踪，但不能用于在大量帧（例如超过5帧）重新检测目标对象。

Another example for an object tracking solution includes performing detection and tracking by searching for a target object based on the results of object detection. One problem with this solution is that it relies on object detection. This has certain problems. For example, object detection is limited by the predefined classes that can be detected, and any object that does not belong to a predefined class cannot be detected. Also, object detection is a complex process that requires significant computational resources.

Another solution involves the use of separate segmentation and tracking models, but this can have a high computational cost due to the use of two models.

Joint scene segmentation and one-time long-term target tracking is an ideal solution that allows a system to perform object tracking on one or more target objects at multiple frames utilizing the features of scene segmentation. Such a solution enables real-time video segmentation and tracking.

在名为“Scene segmentation and object tracking”的高通专利申请中，高通介绍了一种用于联合场景分割和一次性长期对象追踪的场景分割和对象追踪方法。

For a single long-term object tracking and segmentation, the scene segmentation and object tracking system can perform a learning session to learn new objects defined in the initial frame. The scene segmentation and object tracking system can perform long-term tracking. Target segmentation can segment the tracked target object from the background.

Qualcomm said the described technology provides an efficient multitasking solution for semantic segmentation and long-term one-time object tracking and segmentation using semantic scene features, resulting in a system with low computational cost.

Qualcomm Patent Introduces Scene Segmentation and Object Tracking Methods for Joint Scene Segmentation, One-Time Long-Term Object Tracking

FIG. 1 is a block diagram illustrating an architecture of a scene segmentation and object tracking system 100. The scene segmentation and object tracking system 100 includes various components for tracking objects in a sequence of video frames, said sequence of video frames may include one or more input frames 101. as shown, the components of the scene segmentation and object tracking system 100 include a semantic extraction engine 102, a feature memory 104, a cross-attention engine 106, and a prediction engine 108.

The scene segmentation and object tracking system 100 may process a sequence of frames of a scene in which at least one target object is located. The sequence of frames may include one or more input frames 101.

In one embodiment, the sequence of frames may be captured by one or more image capture devices of the scene segmentation and object tracking system 100. In an illustrative example, the scene segmentation and object tracking system 100 may include one RGB camera or multiple RGB cameras. In another illustrative example, the scene segmentation and object tracking system 100 may include one or more IR cameras and one or more RGB cameras.

The semantic extraction engine 102 may process frames from said one or more frames 101 to generate one or more segmentation masks 103 for said frames. the one or more segmentation masks 103 may include a respective categorization of each pixel in said frames. For example, if the frame contains a person, a tree, grass, and sky, the semantic extraction engine 102 may categorize the particular pixel into a "person" class, a "tree" class, a "grass" class, or a "sky" class. "sky" class.

In generating the one or more segmentation masks 103, the semantic extraction engine 102 may generate multiple features. For example, the semantic extraction engine 102 may include or be part of a neural network trained to perform semantic segmentation.

Said neural network may include at least one hidden layer, the hidden layers generating one or more feature vectors or other feature representations to represent features from each of said one or more frames 101. Each hidden layer of the neural network may generate one or more feature vectors from inputs provided by a previous hidden layer. The semantic extraction engine 102 may extract or output features from the one or more hidden layers to the feature memory 104.

In one embodiment, the output of the semantic extraction engine 102 is output to a feature memory 104. the feature memory 104 may store features extracted by the semantic extraction engine 102 for each of said one or more input frames 101. The features stored in the feature memory 104 may represent foreground and background.

As the semantic extraction engine 102 processes each new input frame, the feature memory 104 or the processing device of the system 100 may update the features stored in the feature memory 104 based on the features extracted from the new input frame.

In one embodiment, the feature memory 104 or the processing device may pause to update the feature memory 104 when the prediction confidence from the prediction engine 108 is less than a confidence threshold.

In one embodiment, the feature memory 104 or the processing device may select features with uncertainty less than an uncertainty threshold to update the feature memory 104. features extracted from the initial frame are retained in the feature memory 104.

For example for an initial frame of one or more input frames 101, the scene segmentation and object tracking system 100 may use a foreground-background mask 105 to learn the foreground and background of a particular frame. The foreground-background mask 105 may be a binary mask including pixels representing a first value of the target object in the corresponding input frame and a second value representing the background in the corresponding input frame.

The semantic extraction engine 102 may use the foreground-background mask 105 to guide pairs of corresponding input frames. In one embodiment, the foreground-background mask 105 is used only for initial frames to be initialized to features stored in the feature memory 104 for the frame sequence. The scene segmentation and target tracking system 100 may process subsequent input frames in the frame sequence without using the foreground-background mask as an input.

During tracking, the semantic extraction engine 102 may generate one or more segmentation masks 103 and extract features that may be used by the scene segmentation and object tracking system 100 to track at least one target object in the sequence of frames. For example, the semantic extraction engine 102 may output features extracted from the current input frame to the cross- note engine 106.

As described above, features represent foreground and background. The cross-attention engine 106 may obtain the stored features from the feature memory 104. The cross-attention engine 106 may compare the stored features with the features extracted from the current input frame to generate a combined representation of the foreground and background representing the current frame.

The cross-attention engine 106 may output said combined representation or feature to the prediction engine 108. The prediction engine 108 determines or predicts the position of the target object in the current input frame based on the representation of the foreground and the background of the current input frame in the combined representation or feature, thereby generating the prediction result 107.

In one embodiment, the prediction engine 108 may generate a bounding box indicating the location of the target object. The prediction engine 108 may also generate a foreground-background mask. The foreground-background mask may be used to update the feature memory 104 based on features extracted from the current frame.

Qualcomm Patent Introduces Scene Segmentation and Object Tracking Methods for Joint Scene Segmentation, One-Time Long-Term Object Tracking

FIG. 2 is a block diagram illustrating an example scene segmentation and object tracking system 200. The scene segmentation and object tracking system 200 is an example embodiment of the scene segmentation and object tracking system 100 for initializing a system for performing object tracking on one or more target objects captured by one or more frames of a scene.

For example, the scene segmentation and object tracking system 200 may use an initialization frame 201 to initialize the scene segmentation and object tracking system for tracking. Said initialization frame 201 may include an initial frame of a sequence of frames of said scene.

In an illustrative example, the initialization frame 201 may have a resolution of 480 pixels by 640 pixels with 3 color channels, such as 3 x 480 x 640 shown in FIG. 2. The initialization frame 201 may have any other suitable resolution or number of color channels.

The semantic extraction engine 202 includes a semantic backbone 210 for performing semantic segmentation and feature extraction. the semantic backbone 210 may include a machine learning system, such as a neural network trained to perform semantic segmentation. For example, the semantic backbone 210 may be implemented as an encoder-decoder neural network trained using a plurality of training frames or images based on supervised learning, semi-supervised learning, or unsupervised learning.

The semantic backbone 210 may process the initialization frame 201 to generate one or more segmentation masks 203. similar to the one or more segmentation masks 103, the one or more segmentation masks 203 may include a corresponding classification for each pixel in the initialization frame 201.

As shown in FIG. 2, initialization frame 201 includes a person in the middle of frame 201 and other background information. Said person's body is categorized as a first category, said person's face is categorized as a second category, said leaves in frame 201 are categorized as a third category, and said building in frame 201 is categorized as a fourth category, as shown by different patterns in said one or more segmentation masks 203.

The one or more segmentation masks 203 may be any suitable size. In the example shown in FIG. 2, the one or more segmentation masks 203 have a resolution of 480 x 640, matching the resolution of the initialization frame 201, and a depth of M, corresponding to the number of semantic categories, resulting in an M x 480 x 640 segmentation mask.

In one example, M may equal 4 in the above examples of the "person", "face", "leaf", and "building" categories. In such an example, there are more than 4 semantic categories, but none of the objects in the initialization frame 201 corresponds to the additional semantic categories for which the semantic backbone 210 is trained to classify.

The one or more segmentation masks 203 may be used for one or more processes in a system that includes the scene segmentation and object tracking system 200 or is separate from the scene segmentation and object tracking system 200. For example, a camera system may use a segmentation mask to process frames, such as processing different portions of an image in different ways.

The semantic backbone 210 generates a plurality of features as part of the process of generating one or more segmentation masks 203. For example, in the example neural network implementation for semantic backbone 210, each hidden layer of the neural network may output one or more feature maps with a particular resolution at a particular depth.

In the example encoder-decoder neural network, each subsequent hidden layer of the encoder portion may output a feature mapping with a smaller resolution compared to the previous hidden layer, while each subsequent hidden layer of the decoder portion may output a feature mapping with a larger resolution compared to the previous hidden layer.

In one embodiment, the scene segmentation and object tracking system 200 may be used to track multiple objects. In this multi-object tracking scenario, all object objects may share the same extracted features so that the semantic backbone 210 is run only once for the initialization frame 201.

The semantic extraction engine 202 may extract features from the semantic backbone 210 and output the extracted features to the fusion engine 212 of the semantic extraction engine 202. e.g., each of the hidden layers of the neural network comprising the semantic backbone 210 may output feature maps with different resolutions or scales. The semantic backbone 210 may extract feature maps from one or more hidden layers and output said feature maps to the fusion engine 212.

The semantic extraction engine 202 may output fusion features 213 to a mask embedding engine 214. said mask embedding engine 214 may use said fusion features 213 and foreground-background mask 205 to learn the foreground and background of said initialization frame 201.

In one embodiment, the mask embedding engine 214 may use the foreground-background mask 105 to direct the learning of features of the fusion feature 213 corresponding to the foreground and features of the fusion feature 213 corresponding to the background of the initialization frame 201, thereby providing a one-time learning of both the foreground and the background of the scene depicted in the initialization frame 201.

For example, the mask embedding engine 214 may combine the fusion feature 213 generated for the initialization frame 201 with the foreground-background mask 205 to generate the modified feature 215. In an illustrative example, to combine the fusion feature 213 with the foreground-background mask 205, the mask embedding engine 214 may embed the foreground-background mask 205 into a feature of a particular size having a convolutional layer (denoted by x) and may add the feature x to the fusion feature 213.

The sum of the feature x and the fused feature 213 can be output to another convolutional layer to generate a modified feature 215.The modified feature 215 can have the same size and depth as the fused feature 213 shown in FIG. 2, or can have a different resolution and/or depth.

In one embodiment, the foreground-background mask may only be used to initialize frame 201 to initialize features stored in feature memory 204 for use in the frame sequence. In this regard, the scene segmentation and object tracking system 200 may process subsequent input frames in the frame sequence without the need to use the foreground-background mask as an input for performing tracking.

Said modified features 215 may be output to a feature embedding engine 216, and said feature embedding engine 216 may further embed said modified features 215 as key-value pairs 217 to initialize said feature memory 204. as shown in FIG. 2, the key-value pairs 217 may comprise two feature mappings, each of which has a dimension of 30 x 40 and a depth of 64 channels. The key-value pair 217 may be reshaped into a dimension 64 x N (×2), where 64 is the number of channels, N corresponds to the number of key-value pairs stored in the feature memory 204, and ×2 indicates that there are two 64 x N tensors or vectors.

In this example, there are N keys and N values, each of which is a 64-dimensional tensor or vector. The key-value pairs of the initial storage of features stored in the feature memory 104 may represent the foreground and the background of the initialization frame 201.

As each subsequent frame of the frame sequence is processed by the semantic extraction engine 202, the features stored in the feature memory 104 may be updated based on the newly extracted features.

Qualcomm Patent Introduces Scene Segmentation and Object Tracking Methods for Joint Scene Segmentation, One-Time Long-Term Object Tracking

FIG. 3 is a block diagram illustrating an example of a scene segmentation and object tracking system 300, said system being used to perform predictions on one or more target objects for object tracking in one or more frames following the initialization frame 201 described above with respect to the frame sequence of FIG. 2.

场景分割和对象追踪系统300可以基于通过交叉注意执行的全局搜索来执行用于对象追踪的预测。

在一个实施例中，场景分割与对象追踪系统300是场景分割与对象追踪系统100在对所述帧序列的一个或多个查询帧中的一个或多个目标对象执行追踪时的示例实施例。

在追踪一个或多个目标对象期间，语义主干210可以处理当前查询帧301并使用上述与图2相关的技术生成一个或多个分割掩码303。融合引擎212可以基于从语义主干210提取的特征生成融合特征313。所融合的特征313可输出到所述特征嵌入引擎216，所述特征嵌入引擎216可生成包含两个特征映射的键值对317。

场景分割和对象追踪系统300的交叉注意引擎306可以处理来自特征存储器204的键值对，并且所述键值对317为当前查询帧301输出所述特征嵌入引擎216。例如，交叉注意引擎306可以将来自特征存储器204的存储的键值对与从当前查询帧中提取的键值对317进行比较，以生成表示当前查询帧301的前景和背景的组合表示。

如上所述，发明描述的场景分割和目标追踪系统和技术可以通过利用语义场景分割期间确定的特征来执行一个或多个目标对象在一系列帧的目标追踪，从而提供计算效率高的联合场景分割和目标追踪。

例如，由于用于多对象追踪的特性是与所有被追踪的目标对象共享的，因此随着追踪的目标对象越来越多，延迟的增加是最小的。另外，语义分割模型是追踪器的集成模块，在这种情况下，打开追踪功能可以减少延迟的增加。

名为“Scene segmentation and object tracking”的高通专利申请最初在2022年5月提交，并在日前由美国专利商标局公布。

Generally speaking, after a U.S. patent application is examined, it will be automatically published 18 months from the filing date or priority date, or it will be published within 18 months from the filing date at the request of the applicant. Note that publication of a patent application does not mean that the patent is approved. After a patent application is filed, the USPTO requires actual review, which can take anywhere from 1 to 3 years.