【原】MLMs之Janus：《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻

處女座的程序猿 2025-01-28 發(fā)布于上海

展開全文

MLMs之Janus：《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻譯與解讀

導(dǎo)讀：這篇論文介紹了Janus-Pro，一個(gè)改進(jìn)版的統(tǒng)一多模態(tài)理解和生成模型,。Janus-Pro 通過多方面的改進(jìn)，在統(tǒng)一多模態(tài)理解和生成領(lǐng)域取得了顯著的進(jìn)展，為該領(lǐng)域的研究提供了新的思路和方向,。

>> 背景痛點(diǎn)：現(xiàn)有統(tǒng)一多模態(tài)模型的不足：現(xiàn)有的統(tǒng)一多模態(tài)理解和生成模型通常使用相同的視覺編碼器處理兩種任務(wù)，導(dǎo)致在多模態(tài)理解方面性能欠佳，因?yàn)閮煞N任務(wù)對(duì)圖像表示的需求不同。 Janus模型雖然通過解耦視覺編碼解決了部分問題，但在1B參數(shù)規(guī)模下，訓(xùn)練數(shù)據(jù)有限，模型容量較小，導(dǎo)致在短提示圖像生成和文本到圖像生成的穩(wěn)定性方面表現(xiàn)不足,。

>> 具體的解決方案：Janus-Pro針對(duì)Janus模型的不足，從三個(gè)方面進(jìn)行了改進(jìn)：

● 優(yōu)化的訓(xùn)練策略：修改了Janus的三階段訓(xùn)練流程,。具體包括：延長(zhǎng)第一階段訓(xùn)練，充分利用ImageNet數(shù)據(jù)建模像素依賴；第二階段專注于使用普通文本到圖像數(shù)據(jù)訓(xùn)練，提高● 訓(xùn)練效率；調(diào)整第三階段的監(jiān)督微調(diào)數(shù)據(jù)比例，平衡多模態(tài)理解和視覺生成能力。

● 數(shù)據(jù)擴(kuò)展：顯著增加了訓(xùn)練數(shù)據(jù),。多模態(tài)理解方面，增加了約9000萬個(gè)樣本，涵蓋圖像字幕,、表格、圖表和文檔理解等數(shù)據(jù)；視覺生成方面，增加了約7200萬個(gè)合成美學(xué)數(shù)據(jù)，提高數(shù)據(jù)質(zhì)量，改善生成圖像的穩(wěn)定性和美感,。

● 模型規(guī)模擴(kuò)展：將模型規(guī)模從1.5B參數(shù)擴(kuò)展到7B參數(shù)，驗(yàn)證了視覺編碼解碼方法的可擴(kuò)展性,。

>> 核心思路步驟：Janus-Pro的核心思路是解耦視覺編碼，分別為多模態(tài)理解和視覺生成任務(wù)設(shè)計(jì)獨(dú)立的編碼器。具體步驟如下：

● 獨(dú)立編碼：使用SigLIP編碼器提取圖像的高維語義特征用于理解任務(wù)；使用VQ tokenizer將圖像轉(zhuǎn)換為離散ID用于生成任務(wù),。

● 特征映射：使用理解適配器和生成適配器將圖像特征映射到LLM的輸入空間,。

● 多模態(tài)融合：將映射后的特征序列與文本提示拼接成多模態(tài)特征序列。

● 統(tǒng)一處理：將多模態(tài)特征序列輸入到統(tǒng)一的自動(dòng)回歸Transformer中進(jìn)行處理,。

● 獨(dú)立預(yù)測(cè)頭：視覺生成任務(wù)使用隨機(jī)初始化的預(yù)測(cè)頭進(jìn)行圖像預(yù)測(cè),。

>> 優(yōu)勢(shì)：

● 改進(jìn)的多模態(tài)理解能力：Janus-Pro在多個(gè)多模態(tài)理解基準(zhǔn)測(cè)試中取得了最優(yōu)結(jié)果，顯著優(yōu)于Janus和其他一些模型，即使與參數(shù)量更大的模型相比也具有競(jìng)爭(zhēng)力。

● 顯著提升的文本到圖像生成能力：Janus-Pro在文本到圖像生成任務(wù)上，無論是GenEval還是DPG-Bench，都取得了顯著的性能提升，在指令遵循能力方面表現(xiàn)出色，生成圖像質(zhì)量更高，細(xì)節(jié)更豐富，穩(wěn)定性更好,。

● 模型的可擴(kuò)展性：7B參數(shù)的Janus-Pro模型驗(yàn)證了該方法的可擴(kuò)展性，更大的模型帶來了更快的收斂速度,。

>> 結(jié)論和觀點(diǎn)：

● Janus-Pro通過改進(jìn)訓(xùn)練策略、擴(kuò)展數(shù)據(jù)和增加模型規(guī)模，顯著提升了多模態(tài)理解和文本到圖像生成能力,。

● 解耦視覺編碼是提高統(tǒng)一多模態(tài)模型性能的關(guān)鍵,。

● 盡管取得了顯著進(jìn)展，Janus-Pro仍然存在一些局限性，例如輸入分辨率限制（384x384）影響了其在細(xì)粒度任務(wù)中的性能，以及圖像分辨率低導(dǎo)致細(xì)節(jié)不足的問題。未來可以通過提高圖像分辨率來解決這些問題,。

《Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling》翻譯與解讀

地址	論文地址：https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf
時(shí)間	2025年1月27日
作者	DeepSeek團(tuán)隊(duì)

Abstract

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specif-ically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capa-bilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

在這項(xiàng)工作中，我們推出了 Janus-Pro，這是之前工作 Janus 的高級(jí)版本,。具體而言，Janus-Pro?融合了（1）優(yōu)化的訓(xùn)練策略，（2）擴(kuò)充的訓(xùn)練數(shù)據(jù)，以及（3）對(duì)更大模型規(guī)模的支持,。憑借這些改進(jìn)，Janus-Pro 在多模態(tài)理解和文本到圖像指令遵循能力方面取得了顯著進(jìn)步，同時(shí)提升了文本到圖像生成的穩(wěn)定性,。我們希望這項(xiàng)工作能激發(fā)該領(lǐng)域的進(jìn)一步探索。代碼和模型已公開可用,。

Figure 1 | Multimodal understanding and visual generation results from our Janus-Pro. For multi-modal understand, we average the accuracy of POPE, MME-Perception, GQA, and MMMU. The scores of MME-Perception are divided by 20 to scale to [0, 100]. For visual generation, we evaluate the performance on two instruction-following benchamrks, GenEval and DPG-Bench. Overall, Janus-Pro outperforms the previous state-of-the-art unified multimodal models as well as some task-specific models. Best viewed on screen.圖 1 | 我們的 Janus-Pro 的多模態(tài)理解和視覺生成結(jié)果,。對(duì)于多模態(tài)理解，我們對(duì) POPE、MME-Perception,、GQA 和 MMMU 的準(zhǔn)確率取平均值,。MME-Perception 的分?jǐn)?shù)除以 20 以縮放到 [0, 100] 范圍。對(duì)于視覺生成，我們?cè)趦蓚€(gè)指令遵循基準(zhǔn) GenEval 和 DPG-Bench 上評(píng)估其性能,?？傮w而言，Janus-Pro 超過了之前的最先進(jìn)的統(tǒng)一多模態(tài)模型以及一些特定任務(wù)的模型。建議在屏幕上查看效果最佳,。

1,、Introduction

Recent advancements in unified multimodal understanding and generation models have demonstrated significant progress [30, 40, 45, 46, 48, 50, 54, 55]. These approaches have been proven to enhance the instruction-following capabilities in visual generation tasks while re-ducing model redundancy. Most of these methods utilize the same visual encoder to process inputs for both multimodal understanding and generation tasks. Since the representations required for these two tasks differ, this often results in suboptimal performance in multimodal understanding. To address this issue, Janus [46] proposes decoupling visual encoding, which alleviates the conflict between multimodal understanding and generation tasks, achieving excellent performance in both tasks.

As a pioneering model, Janus is validated at the 1B parameter scale. However, due to the limited amount of training data and the relatively small model capacity, it exhibites certain shortcomings, such as suboptimal performance on short prompts image generation and unstable text-to-image generation quality. In this paper, we introduce Janus-Pro, an enhanced version of Janus that incorporates improvements across three dimensions: training strategies, data, and model size. The Janus-Pro series includes two model sizes: 1B and 7B, demonstrating scalability of the visual encoding decoding method.

統(tǒng)一多模態(tài)理解和生成模型的最新進(jìn)展已取得顯著成果[30, 40, 45, 46, 48, 50, 54, 55]。這些方法已被證明能夠提升視覺生成任務(wù)中的指令遵循能力，同時(shí)減少模型冗余,。大多數(shù)這些方法都使用相同的視覺編碼器來處理多模態(tài)理解和生成任務(wù)的輸入,。由于這兩個(gè)任務(wù)所需的表示不同，這往往導(dǎo)致多模態(tài)理解任務(wù)的表現(xiàn)不佳。為了解決這個(gè)問題，Janus [46] 提出了視覺編碼解耦，這緩解了多模態(tài)理解和生成任務(wù)之間的沖突，在這兩個(gè)任務(wù)中都取得了出色的表現(xiàn),。

作為開創(chuàng)性的模型，Janus 在 10 億參數(shù)規(guī)模上得到了驗(yàn)證,。然而，由于訓(xùn)練數(shù)據(jù)量有限以及模型容量相對(duì)較小，它存在一些不足之處，例如在短提示圖像生成方面的表現(xiàn)欠佳以及文本到圖像生成質(zhì)量不穩(wěn)定。在本文中，我們介紹了 Janus-Pro，這是 Janus 的增強(qiáng)版，在訓(xùn)練策略,、數(shù)據(jù)和模型規(guī)模三個(gè)維度上都有所改進(jìn),。Janus-Pro 系列包含兩種模型規(guī)模：10 億參數(shù)和 70 億參數(shù)，展示了視覺編碼解碼方法的可擴(kuò)展性。

We evaluate Janus-Pro on multiple benchmarks, and the results reveal its superior multi-modal understanding capabilities and significantly improved text-to-image instruction-following performance. Specifically, Janus-Pro-7B achieved a score of 79.2 on the multimodal understand-ing benchmark MMBench [29], surpassing state-of-the-art unified multimodal models such as Janus [46] (69.4), TokenFlow [34] (68.9) and MetaMorph [42] (75.2). Additionally, in the text-to-image instruction-following leaderboard GenEval [14], Janus-Pro-7B scores 0.80, outperforming Janus [46] (0.61), DALL-E 3 (0.67), and Stable Diffusion 3 Medium [11] (0.74).

我們?cè)诙鄠€(gè)基準(zhǔn)測(cè)試中對(duì) Janus-Pro 進(jìn)行了評(píng)估，結(jié)果表明其具有卓越的多模態(tài)理解能力和顯著提升的文本到圖像指令遵循性能,。具體而言，Janus-Pro-7B 在多模態(tài)理解基準(zhǔn) MMBench [29] 上獲得了 79.2 的分?jǐn)?shù)，超過了諸如 Janus [46]（69.4）,、TokenFlow [34]（68.9）和 MetaMorph [42]（75.2）等最先進(jìn)的統(tǒng)一多模態(tài)模型。此外，在文本到圖像指令遵循排行榜 GenEval [14] 上，Janus-Pro-7B 的得分是 0.80，優(yōu)于 Janus [46]（0.61）,、DALL-E 3（0.67）和 Stable Diffusion 3 Medium [11]（0.74）,。

Figure 2 | Comparison of text-to-image generation between Janus-Pro and its predecessor,Janus. Janus-Pro delivers more stable outputs for short prompts, with improved visual quality,richer details, and the ability to generate simple text. The image resolution is 384 × 384. Best viewed on screen.圖 2 | Janus-Pro 與其前身 Janus 在文本轉(zhuǎn)圖像生成方面的比較。Janus-Pro 對(duì)于短提示能提供更穩(wěn)定的輸出，視覺質(zhì)量更高，細(xì)節(jié)更豐富，并且能夠生成簡(jiǎn)單的文本,。圖像分辨率為 384×384,。建議在屏幕上查看效果最佳。

Figure 3 | Architecture of our Janus-Pro. We decouple visual encoding for multimodal under-standing and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Best viewed on screen.圖 3 | 我們的 Janus-Pro 架構(gòu),。我們將用于多模態(tài)理解的視覺編碼與用于視覺生成的視覺編碼分離開來,。“Und. Encoder”和“Gen. Encoder”分別是“理解編碼器”和“生成編碼器”的縮寫,。建議在屏幕上查看以獲得最佳效果,。

Conclusion

This paper introduces improvements to Janus from three aspects: training strategy, data, and model size. These enhancements have led to significant advancements in both multimodal understanding and text-to-image instruction-following capabilities. However, Janus-Pro still has certain limitations. In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-to-image generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.

本文從訓(xùn)練策略、數(shù)據(jù)和模型規(guī)模三個(gè)方面介紹了對(duì) Janus 的改進(jìn)。這些改進(jìn)在多模態(tài)理解和文本到圖像的指令遵循能力方面都取得了顯著進(jìn)展,。然而，Janus-Pro 仍存在一些局限性,。在多模態(tài)理解方面，輸入分辨率限制在 384×384，這影響了其在諸如 OCR 等細(xì)粒度任務(wù)中的表現(xiàn)。對(duì)于文本到圖像生成，低分辨率加上視覺標(biāo)記器引入的重建損失，導(dǎo)致生成的圖像雖然語義豐富，但仍缺乏細(xì)節(jié),。例如，占據(jù)圖像空間有限的小面部區(qū)域可能會(huì)顯得不夠清晰,。提高圖像分辨率可以緩解這些問題。