Multimodal CoT Prompting

多模式 CoT 提示

Zhang et al. (2023) (opens in a new tab) recently proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality. In contrast, Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales.

Zhang等人（2023） (opens in a new tab) 最近提出了一種多模態的思維鏈提示方法。傳統的CoT側重於語言模態。相比之下，多模態CoT將文字和視覺融入到一個兩階段框架中。第一步涉及基於多模態資訊的理由產生。接下來是第二階段的答案推理，利用產生的理由進行推理。

The multimodal CoT model (1B) outperforms GPT-3.5 on the ScienceQA benchmark.

多模式 CoT 模型（1B）在 ScienceQA 基準測試中表現優於 GPT-3.5。

Image Source: Zhang et al. (2023) (opens in a new tab)

圖片來源：Zhang et al. (2023) (opens in a new tab)