US-China research team creates XR Vincennes 3D scene evaluation metrics with GPT-4V technology - XR Navigator News

(XR Navigator Information) The fields of Text to Image, Text to Video and Text to 3D are booming. It is conceivable that the described text generation technology can be combined with XR to quickly generate a variety of realistic scenes for XR.

However, the initial development of the technology may suffer from a lack of reliable evaluation metrics. In the research around text-generated 3D, by the Chinese University of Hong Kong, Stanford University,adobeThe GPT-4V is an assessment metric that can be aligned with human judgment to effectively provide an efficient and comprehensive assessment against text-generated 3D models, according to the team from Nanyang Technological University and the Shanghai Artificial Intelligence Laboratory.

Driven by a series of breakthroughs in neural 3D representation, extensive dataset development, scalable generative models, and innovative applications of text-image-based models for 3D generation, the field of text generation 3D has made significant progress over the past year. Given this momentum, it is reasonable to expect a rapid increase in research efforts and progress in the field of generative modeling for text generation 3D.

However, the team believes that appropriate assessment metrics for text-generated 3D models have not kept pace, and that this shortcoming may prevent further improvement of relevant generative models. They note that existing metrics typically focus on a single criterion and lack generalizability across multiple 3D assessment needs. For example, Clip-based metrics are designed to measure the extent to which a 3D asset is aligned with its input text, but they may not adequately assess geometric and texture details. This lack of flexibility leads to inconsistency with human judgment in evaluation criteria. As a result, many researchers rely on user studies for accurate and comprehensive evaluations.

On the other hand, while user studies are adaptive and can accurately reflect human judgment, they can be costly, difficult to scale, and time-consuming. Therefore, most user studies are centered around a very limited set of text-prompted inputs.

This begs the question: can we create automated metrics that are applicable to a wide range of assessment criteria and closely aligned with human judgment?

设计满足标准的度量标准涉及三个基本功能：生成输入文本提示、理解人类意图以及对3D物理世界的推理。幸运的是，LMMs，特别是GPT-4Vision （GPT-4V）在满足所述要求方面表现出了相当大的前景。

The team drew inspiration from the human ability to perform 3D reasoning tasks using 2D visual information under linguistic guidance and hypothesized that the GPT-4V would be able to perform similar 3D model evaluation tasks.

In the paper titled "GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation," the researchers present a proof-of-concept demonstrating the use of GPT-4V to develop customizable, scalable, and human-aligned evaluation metrics for text-to-3D generation tasks.

Constructing such an assessment metric is analogous to creating an exam, which requires two steps: formulating the questions and assessing the answers. In order to effectively assess text-generated 3D models, it is critical to obtain a set of input prompts that accurately reflect the assessor's needs.

Relying on static, heuristically generated prompts is not sufficient. Instead, the researchers developed a "meta-prompt/meta-prompt" system in which GPT-4V generates a set of input prompts tailored to the focus of the assessment. After generating relevant input text prompts, the method described involves comparing 3D shapes to user-defined criteria, similar to grading an exam.

Creating evaluation metrics for text-generated 3D models requires deciding which set of input text prompts should be used as input to the model. Ideally, we would like to use all possible user input prompts, but this is not computationally feasible. Alternatively, we would like to build a generator that outputs cues that model the actual distribution of user input.

The goal of the evaluation metrics is to rank a set of text-generated 3D models based on user-defined criteria. The approach proposed by the team involves two main components. First, a decision needs to be made about which textual prompt to use as input for the evaluation task. To achieve this goal, the researchers developed an automated prompt generator that generates text prompts with customizable levels of complexity and creativity. The second component is a versatile 3D asset comparator that compares pairs of 3D shapes generated from a given textual prompt based on input assessment criteria.

In summary, the components described allow teams to assign a ranking score to each model using the Elo rating system.

US-China research team creates XR Vincennes 3D scene evaluation metrics with GPT-4V technology - XR Navigator News

As shown above, the team created a customizable instruction template containing the information needed for the GPT-4V to perform two 3D asset comparison tasks. The researchers completed this template with different evaluation criteria, inputting 3D images and random seeds to create the final 3D perceptual cue for the GPT-4V. The GPT-4V would then consume the relevant inputs to output its evaluation. Finally, they collected the GPT-4V's answers to create a reliable final estimate of the task.

Preliminary empirical evidence suggests that the team's proposed framework can outperform existing metrics and achieve better alignment with human judgment across different assessment criteria. The results show that correlation metrics can be effective in providing an efficient and comprehensive assessment against text-generated 3D models.

US-China research team creates XR Vincennes 3D scene evaluation metrics with GPT-4V technology - XR Navigator News

However, the team admits that their program still faces several unresolved challenges:

Due to limited resources, the experimental and user studies of the paper were relatively small in size. Scaling up the study is important to better test the hypotheses.
The GPT-4V response is not always correct. For example, the GPT-4V sometimes hallucinates, which is a common problem with many large pre-trained models.
A good metric should be "non-spoofable". However, one may construct adversarial patterns to attack GPT-4V, so that one may get a high score without having to generate a high quality 3D asset.
While the dissertation's approach is more scalable than conducting user preference studies, it may be subject to computational limitations such as GPT-4V API access restrictions. As the number of models to be evaluated grows, the thesis approach requires a quadratic growth in the number of comparisons, which may not scale well with limited computational resources. Therefore, it would be interesting to investigate how GPT-4V can be utilized to improve efficiency by intelligently selecting input prompts.

Overall, the paper proposes a new framework utilizing GPT-4V to build a customizable and scalable assessment metric consistent with human judgment for text generation 3D tasks. First, the team proposes a cue generator that generates input cues based on the evaluator's needs. Second, they prompted the GPT-4V with a series of customizable "3D perceptual cues" that allow the GPT-4V to compare two 3D assets according to the evaluator's needs while maintaining consistency with human judgments on a variety of criteria.

With these two components, the researchers were able to use the Elo system to rank text-generated 3D models. Experimental results confirm that the team's proposed method can outperform existing metrics in a variety of criteria.