CMU, MIT researchers use explicit volume representation to design SLAM solutions

2,056 0

CheckCitation/SourcePlease click:XR Navigation Network

(XR Navigation Network March 13, 2024) Over the past three decades, just-in-time localization and mapping (SLAM) research has extensively revolved around the problem of mapping representations, and has resulted in a variety of sparse, dense, and neural scene representations. This fundamental choice of mapping representation greatly affects the design of each processing module in a SLAM system, as well as downstream tasks that depend on SLAM outputs.

Although systems based on mapped representations have matured to production levels over the past several years, they still have significant shortcomings that need to be addressed. Tracking explicit representations critically relies on the availability of rich 3D geometric features and high frame rate captures. The described methods can only interpret the observed portion of the scene, but many applications such as mixed reality and high-fidelity 3D capture require techniques that can interpret unobserved/novel camera viewpoints.

The shortcomings of handcrafted representations, coupled with the emergence of luminance field representations for high-quality image synthesis have driven a class of approaches that attempt to encode a scene into a neural network weight space. Such radiance field-based SLAM algorithms benefit from high-fidelity global mapping and image reconstruction loss, and they can capture dense photometric information through differentiable rendering. However, current approaches use implicit neural representations to model volumetric radiation fields, which leads to numerous problems in the SLAM setting: computationally inefficient, not easily editable, not able to explicitly model the spatial geometry, and so on.

For Carnegie Mellon and MIT, the team explored the question: how can SLAM solutions be designed using explicit volumetric representations?

CMU, MIT researchers use explicit volume representation to design SLAM solutions

Specifically, the researchers used a 3D Gaussians-based luminance field to Splat (render), Track (track) and Map (map). They believe this representation has the following advantages over existing mapping representations:

Fast Rendering and Rich Optimization: Gaussian Splatting can render at up to 400 FPS, allowing for faster visualization and optimization than implicit alternatives. A key factor for fast optimization is the rasterization of 3D primitives. The team introduced several simple modifications to improve the speed of Splatting for SLAM, including removing view-dependent appearance and using isotropic Gaussians.Additionally, this allowed the team to perform SLAM in real-time using dense luminosity loss, whereas traditional and implicit mapping representations rely on sparse 3D geometric features or pixel sampling to maintain efficiency.
Mappings with a well-defined spatial extent: the spatial boundaries of existing mappings can be easily controlled by adding Gaussians to parts of the scene that have been observed in the past. Given a new image frame, this allows one to efficiently identify which parts of the scene are new content by rendering the silhouette. This is crucial for camera tracking, as the researchers only want to compare the mapped regions of the scene to the new image. However, this is very difficult for implicit mapping representations due to the fact that the network is subject to global changes during gradient-based optimization of the unmapped space.
Explicit mapping: the mapping volume can be arbitrarily increased by simply adding more Gaussians. Additionally, this explicit volume representation makes it possible to edit parts of the scene while still allowing photorealistic rendering. Implicit methods cannot easily increase the volume or edit the scene they represent.
Direct gradient flow: Since the scene is represented by Gaussians with physical 3D position, color and size, there is a direct, almost linear (projected) gradient flow between the parameters and the rendering. Since camera motion can be thought of as keeping the camera stationary and moving the scene, the team can similarly direct gradient flow to the camera parameters, which allows for fast optimization. The neural network based representation does not, as the gradient needs to flow through (potentially many) non-linear neural network layers.

Given all of the above advantages, an explicit volumetric representation is a natural way to efficiently infer high-fidelity spatial mappings and simultaneously estimate camera motion. Experiments on simulated and real data show that the team's method, SplaTAM, achieves state-of-the-art results compared to all previous methods for camera pose estimation, mapping estimation, and new view synthesis.

SplaTAM is the first dense RGB-D SLAM solution to use 3D Gaussian Splatting, the team said. By modeling the world as a collection of 3D Gaussian images that can be rendered with high-fidelity color and depth images, the system is able to directly use differentiable rendering and gradient-based optimization to optimize camera poses and the underlying volumetric discretized world mapping for each frame.

The researchers represented the underlying mapping of the scene as a set of 3D Gaussians. by using only view-independent colors and forcing the Gaussians to be isotropic. This means that each Gaussian has only 8 parameter values.

The core of the described method is the ability to render high fidelity color, depth, and silhouette images from the underlying Gaussian Map into any possible camera reference frame in a differentiable manner. This differentiable rendering allows for direct calculation of the gradient of the error in the underlying scene representation and camera parameters relative to the error between the rendered and supplied RGB-D frames, and updating of the Gaussian and camera parameters to minimize the error, resulting in fitting accurate camera poses and precise world volume representations.

Gaussian Splatting renders RGB images as follows: given a set of Gaussian Splatting distributions and camera poses, all Gaussians are first sorted from front to back. The RGB image can then be efficiently rendered by alpha-synthesizing the splatted 2D projections of each Gaussian in order in pixel space.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

The researchers constructed a SLAM system from a Gaussian representation and a micro-renderable renderer. Assume that there is an existing mapping that has been fitted from a set of camera frames 1 to t. Given a new RGB-D frame t + 1, the SLAM system performs the steps shown in Figure 2:

camera tracking: minimizes image and depth reconstruction errors for RGB-D frames with respect to the camera pose parameter at t+1, but only evaluates errors for pixels within the visible silhouette
Gaussian densification: adds a new Gaussian to the mapping based on the rendered silhouette and input depth.
Mapping update: given the camera pose from frame 1 to frame t + 1, update the parameters of all Gaussians in the scene by minimizing the RGB and depth errors of all images up to t + 1. In practice, a selected subset of keyframes overlapping with the most recent frames is optimized in order to keep the batch size manageable.

The researchers evaluated the proposed methods on four datasets: scannet++, Replica, TUM-RGBD, and raw ScanNet. the latter three were chosen to follow the evaluation procedure of the previous radial-field-based SLAM methods Point-SLAM and NICE-SLAM. However, they add scannet++ evaluation because none of the other three benchmarks have the ability to evaluate the rendering quality of new views, only the camera pose estimation and rendering of training views.

Replica is the easiest benchmark because it contains synthetic scenes, highly accurate and complete (synthetic) depth maps, and the displacement between successive camera poses is small.TUM-RGBD and raw ScanNet are more difficult, especially for the dense methods, because they both use old low quality cameras, so the RGB and depth images are of poor quality. Depth images are very sparse and have a lot of missing information, while color images have very high motion blur.

For Scannet++, the team used DSLR captures from two scenes (8b5caf3398 (S1) and b20a261fdf (S2)) containing full dense trajectories. The color and depth images of scannet++ are of very high quality compared to other benchmarks, and a second capture loop is provided for each scene to evaluate completely novel views.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

In Table 1, the team compares the camera pose estimation results of the proposed method with a series of baselines from the four datasets. In scannet++, the SOTA SLAM methods Point-SLAM and ORB-SLAM3 are completely unable to track the camera pose correctly due to the very large displacements between neighboring cameras, resulting in very large pose estimation errors. In particular, for ORB-SLAM3, it was observed that textureless ScanNet++ scans resulted in tracking re-initialization multiple times due to lack of features. In contrast, the team's approach successfully tracked the camera in two sequences with an average trajectory error of only 1.2 cm.

In the Replica dataset, the previous de facto evaluation benchmark, the team's proposed method reduces the trajectory error of the previous SOTA from 0.52 cm to 0.36 cm, a reduction of more than 301 TP3T.

In TUM-RGBD, all volumetric methods are very difficult due to poor depth sensor information (very sparse) and poor RGB image quality (extremely high motion blur). However, compared to previous methods in this category, the team's proposed method still significantly outperforms the previous SOTA, reducing the trajectory error by nearly 401 TP3T, from 8.92 cm to 5.48 cm. for this benchmark, however, feature-based sparse tracking methods are still dense methods, such as ORB-SLAM2.

The original ScanNet benchmark had similar issues with TUM-RGBD, so no dense volume method was able to obtain results with less than 10cm trajectory error. In this benchmark, the performance of the team's proposed method is similar to the two previous SOTA methods.

Overall, the above results for camera pose estimation are very promising and demonstrate the benefits of the SplaTAM approach. scannet++'s results show that if you have high quality clean input images, even if there is great movement between camera positions, the team's approach can successfully and accurately perform SLAM, which is something that was not possible with previous SOTA approaches! things.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

In Figure 3, the team shows the results of visualizing the Gaussian Map for two sequence reconstructions from scannet++. As can be seen, these reconstructions are of incredibly high quality in terms of geometry and visual appearance. This is one of the main advantages of using a mapping representation based on 3D Gaussian Splatting.

The team also shows the camera trajectory estimated by the proposed method and the camera pose truncated cone overlaid on the mapping. We can easily see the large displacements that often occur between successive camera poses, which makes this a very difficult SLAM benchmark, yet our method solves this problem very accurately.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

As shown in Table 2, they evaluated the rendering quality of the proposed method in the input view of the Replica dataset. The researchers' method obtained similar PSNR, SSIM, and LPIPS results as PointSLAM, but this comparison is not fair because PointSLAM has the unfair advantage of taking the ground-truth depth of an image as input in order to sample its 3D volume for rendering.

团队的方法比其他基线Vox-Fusion和NICESLAM取得了更好的结果，在PSNR方面比两者都提高了约10dB。

CMU, MIT researchers use explicit volume representation to design SLAM solutions

Also in the scannet++ benchmark test, the results presented for the new view and the training view can be seen in Table 3. the team's proposed method obtains better synthesis results for the new view, with an average PSNR of 24.41, while the PSNR for the training view is slightly higher, at 27.98. at the same time, the team's proposed method obtains incredibly accurate reconstruction, with a depth error of only 2cm in the new view and 1.3cm in the training view. the excellent reconstruction results can be seen in Figure 3. in the new view and a depth error of 1.3 cm in the training view.The excellent reconstruction results can be seen in Figure 3.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

The visual results of the RGB and depth based rendering of the new and training views are shown in Figure 4. It can be seen that the team's proposed method achieves good results in terms of both scenarios for both the new view and the training view. In contrast, Point-SLAM fails in camera pose tracking and overfitting of the training view, and does not succeed in rendering the new view at all.Point-SLAM uses both ground-truth depths as inputs to the rendering in order to determine where to sample, so that the depth map looks similar to the ground-truth, but the color rendering is completely Error.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

SplaTAM involves fitting camera pose and scene mapping using photometric (RGB) and depth loss. In Table 5, they abandon the decision to use both and investigate the performance of using only one or the other in tracking and mapping. They use Replica's Room 0 for this exploration.

Using only depth, the team's proposed method is completely unable to track the camera trajectory because the L1 depth loss does not provide enough information in the x-y image plane. Camera trajectories can be successfully tracked using only RGB loss (but the error between the two is more than a factor of 5.) RGB and depth work together to achieve very good results. For considering only color losses, the PSNR of the reconstruction is very high, only 1.5 PSNR lower than the full model. but the depth L1 using only color is much higher compared to optimizing the depth error directly. In the pure color experiments, depth is not used as a loss for tracking or mapping, but for Gaussian densification and initialization.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

In Table 4, three aspects of camera tracking are eliminated: using forward velocity propagation; using a silhouette mask to hide invalidly mapped regions in the loss; and setting the silhouette threshold to 0.99 instead of 0.5. These three points are critical to achieving excellent results. Tracking without forward velocity propagation still works, but the overall error is more than a factor of 10. silhouette is very important because without it, tracking fails completely. Setting the silhouette threshold to 0.99 allowed the loss to be applied to the pixels optimized in the mapping, resulting in a 5x reduction in error compared to the threshold of 0.5 used for densification.

CMU, MIT researchers use explicit volume representation to design SLAM solutions

Table 6 compares the runtimes (Nvidia RTX 3080 Ti). Each iteration of the team's proposed method renders a full 1200 × 980 pixel image (~ 1.2 million pixels), whereas the other methods use only 200 pixels for tracking and 1000 pixels for mapping each iteration (but try to sample the pixels subtly). Although the team's proposed method renders 3 orders of magnitude more pixels, the runtimes are similar, largely due to the efficiency of the rasterized 3D Gaussian.

In addition, one of the versions of the team's proposed method has fewer iterations and a semi-resolution density of SplaTAM-s. It works five times faster with only a slight decrease in performance. In particular, SplaTAM uses 40 and 60 iterations per frame in Replica for tracking and mapping, respectively, while SplaTAM- s uses 10 and 15 iterations per frame.

Of course, the researchers confess that despite the state-of-the-art performance achieved by SplaTAM, they found that the method exhibited some sensitivity to motion blur, large depth noise and violent rotation. The team believes a possible solution is to temporarily simulate said effects and hopes to address this issue in future research. Additionally, the method requires known camera features and dense depths as inputs for performing SLAM, removing said dependencies is an interesting avenue for the future.

Overall, researchers at Carnegie Mellon University and MIT have proposed a novel SLAM system, SplaTAM, which can leverage 3D Gaussians Splatting radiation fields as its underlying mapping representation and enables faster rendering and optimization, explicit knowledge of the spatial extent of the mapping, and simplified mapping density. Experiments demonstrate its effectiveness in achieving state-of-the-art results for camera pose estimation, scene reconstruction and new view synthesis.

The team believes that SplaTAM not only sets a new benchmark in the field of SLAM and new view synthesis, but also opens up exciting avenues where the integration of 3D Gaussians Splatting with SLAM will provide a powerful framework for further exploration and innovation in scene understanding. The research emphasizes the potential of this integration and paves the way for more sophisticated and efficient SLAM systems in the future.