【原】LLMs之MoE之DeepSeek-V3：DeepSeek-V3的簡介,、安裝和使用方法,、案例應(yīng)用之詳細攻略

處女座的程序猿 2024-12-27 發(fā)布于上海

展開全文

LLMs之MoE之DeepSeek-V3：DeepSeek-V3的簡介、安裝和使用方法,、案例應(yīng)用之詳細攻略

DeepSeek-V3的簡介

DeepSeek-V3 是一個擁有?671B 個總參數(shù)，每個 token 激活 37B 參數(shù)的強大混合專家 (MoE) 語言模型,。為了實現(xiàn)高效的推理和經(jīng)濟高效的訓練，DeepSeek-V3 采用了在 DeepSeek-V2 中經(jīng)過充分驗證的多頭潛在注意力 (MLA) 和 DeepSeekMoE 架構(gòu)。它在 14.8 萬億個多樣化且高質(zhì)量的 tokens 上進行預(yù)訓練，隨后進行監(jiān)督微調(diào)和強化學習，以充分發(fā)揮其能力,。 DeepSeek-V3 的訓練過程非常穩(wěn)定，沒有出現(xiàn)任何不可恢復(fù)的損失峰值或回滾,。其全部訓練僅需 2.788M H800 GPU 小時。

總而言之，DeepSeek-V3 是一個高性能,、高效訓練且易于部署的開源大型語言模型，其在多個領(lǐng)域展現(xiàn)出強大的能力，并支持多種硬件平臺和推理框架,。

GitHub地址：GitHub - deepseek-ai/DeepSeek-V3

1、DeepSeek-V3 特點

>> 高效的 MoE 架構(gòu)：使用多頭潛在注意力?(MLA) 和 DeepSeekMoE 架構(gòu)，實現(xiàn)高效推理和經(jīng)濟高效的訓練,。

>> 創(chuàng)新的負載均衡策略：采用無輔助損失的負載均衡策略，最大限度地減少了由于鼓勵負載均衡而導致的性能下降,。

>> 多標記預(yù)測 (MTP) 目標：采用多標記預(yù)測目標，提高模型性能，并可用于推測解碼以加速推理。

>> FP8 混合精度訓練：首次驗證了 FP8 訓練在大規(guī)模模型上的可行性和有效性，顯著提高訓練效率并降低訓練成本,。

>> 推理優(yōu)化：支持 FP8 和 BF16 推理，并與多個開源推理框架集成，例如 DeepSeek-Infer Demo,、SGLang、LMDeploy 和 TensorRT-LLM 等，支持在 NVIDIA 和 AMD GPU 以及華為 Ascend NPU 上運行,。

>> 知識蒸餾：從 DeepSeek-R1 系列模型中蒸餾推理能力，提升 DeepSeek-V3 的推理性能，同時控制輸出風格和長度,。

>> 優(yōu)秀的性能：在各種基準測試中超越其他開源模型，并與領(lǐng)先的閉源模型性能相當。

2,、模型性能

綜合評估表明，DeepSeek-V3 優(yōu)于其他開源模型，并實現(xiàn)了與領(lǐng)先的閉源模型相當?shù)男阅?。盡管性能出色，但 DeepSeek-V3 僅需要 2.788M H800 GPU 小時即可完成完整訓練。此外，它的訓練過程非常穩(wěn)定,。在整個訓練過程中，我們沒有遇到任何無法恢復(fù)的損失峰值或執(zhí)行任何回滾,。

Base Model

Standard Benchmarks

	Benchmark (Metric)	# Shots	DeepSeek-V2	Qwen2.5 72B	LLaMA3.1 405B	DeepSeek-V3
	Architecture	-	MoE	Dense	Dense	MoE
	# Activated Params	-	21B	72B	405B	37B
	# Total Params	-	236B	72B	405B	671B
English	Pile-test (BPB)	-	0.606	0.638	0.542	0.548
	BBH (EM)	3-shot	78.8	79.8	82.9	87.5
	MMLU (Acc.)	5-shot	78.4	85.0	84.4	87.1
	MMLU-Redux (Acc.)	5-shot	75.6	83.2	81.3	86.2
	MMLU-Pro (Acc.)	5-shot	51.4	58.3	52.8	64.4
	DROP (F1)	3-shot	80.4	80.6	86.0	89.0
	ARC-Easy (Acc.)	25-shot	97.6	98.4	98.4	98.9
	ARC-Challenge (Acc.)	25-shot	92.2	94.5	95.3	95.3
	HellaSwag (Acc.)	10-shot	87.1	84.8	89.2	88.9
	PIQA (Acc.)	0-shot	83.9	82.6	85.9	84.7
	WinoGrande (Acc.)	5-shot	86.3	82.3	85.2	84.9
	RACE-Middle (Acc.)	5-shot	73.1	68.1	74.2	67.1
	RACE-High (Acc.)	5-shot	52.6	50.3	56.8	51.3
	TriviaQA (EM)	5-shot	80.0	71.9	82.7	82.9
	NaturalQuestions (EM)	5-shot	38.6	33.2	41.5	40.0
	AGIEval (Acc.)	0-shot	57.5	75.8	60.6	79.6
Code	HumanEval (Pass@1)	0-shot	43.3	53.0	54.9	65.2
	MBPP (Pass@1)	3-shot	65.0	72.6	68.4	75.4
	LiveCodeBench-Base (Pass@1)	3-shot	11.6	12.9	15.5	19.4
	CRUXEval-I (Acc.)	2-shot	52.5	59.1	58.5	67.3
	CRUXEval-O (Acc.)	2-shot	49.8	59.9	59.9	69.8
Math	GSM8K (EM)	8-shot	81.6	88.3	83.5	89.3
	MATH (EM)	4-shot	43.4	54.4	49.0	61.6
	MGSM (EM)	8-shot	63.6	76.2	69.9	79.8
	CMath (EM)	3-shot	78.7	84.5	77.3	90.7
Chinese	CLUEWSC (EM)	5-shot	82.0	82.5	83.0	82.7
	C-Eval (Acc.)	5-shot	81.4	89.2	72.5	90.1
	CMMLU (Acc.)	5-shot	84.0	89.5	73.7	88.8
	CMRC (EM)	1-shot	77.4	75.8	76.0	76.3
	C3 (Acc.)	0-shot	77.4	76.7	79.7	78.6
	CCPM (Acc.)	0-shot	93.0	88.5	78.6	92.0
Multilingual	MMMLU-non-English (Acc.)	5-shot	64.0	74.8	73.8	79.4

注意：最佳結(jié)果以粗體顯示。分差不超過 0.3 的分數(shù)被視為處于同一水平,。DeepSeek-V3 在大多數(shù)基準測試中表現(xiàn)最佳，尤其是在數(shù)學和代碼任務(wù)方面,。更多評估詳情，請查閱我們的論文。

Context Window上下文窗口

在“大海撈針”（NIAH）測試中的評估結(jié)果。DeepSeek-V3 在所有上下文窗口長度（最長為 128K）上的表現(xiàn)都很出色,。

Chat Model

標準基準（大于 670 億參數(shù)的模型）

	Benchmark (Metric)	DeepSeek V2-0506	DeepSeek V2.5-0905	Qwen2.5 72B-Inst.	Llama3.1 405B-Inst.	Claude-3.5-Sonnet-1022	GPT-4o 0513	DeepSeek V3
	Architecture	MoE	MoE	Dense	Dense	-	-	MoE
	# Activated Params	21B	21B	72B	405B	-	-	37B
	# Total Params	236B	236B	72B	405B	-	-	671B
English	MMLU (EM)	78.2	80.6	85.3	88.6	88.3	87.2	88.5
	MMLU-Redux (EM)	77.9	80.3	85.6	86.2	88.9	88.0	89.1
	MMLU-Pro (EM)	58.5	66.2	71.6	73.3	78.0	72.6	75.9
	DROP (3-shot F1)	83.0	87.8	76.7	88.7	88.3	83.7	91.6
	IF-Eval (Prompt Strict)	57.7	80.6	84.1	86.0	86.5	84.3	86.1
	GPQA-Diamond (Pass@1)	35.3	41.3	49.0	51.1	65.0	49.9	59.1
	SimpleQA (Correct)	9.0	10.2	9.1	17.1	28.4	38.2	24.9
	FRAMES (Acc.)	66.9	65.4	69.8	70.0	72.5	80.5	73.3
	LongBench v2 (Acc.)	31.6	35.4	39.4	36.1	41.0	48.1	48.7
Code	HumanEval-Mul (Pass@1)	69.3	77.4	77.3	77.2	81.7	80.5	82.6
	LiveCodeBench (Pass@1-COT)	18.8	29.2	31.1	28.4	36.3	33.4	40.5
	LiveCodeBench (Pass@1)	20.3	28.4	28.7	30.1	32.8	34.2	37.6
	Codeforces (Percentile)	17.5	35.6	24.8	25.3	20.3	23.6	51.6
	SWE Verified (Resolved)	-	22.6	23.8	24.5	50.8	38.8	42.0
	Aider-Edit (Acc.)	60.3	71.6	65.4	63.9	84.2	72.9	79.7
	Aider-Polyglot (Acc.)	-	18.2	7.6	5.8	45.3	16.0	49.6
Math	AIME 2024 (Pass@1)	4.6	16.7	23.3	23.3	16.0	9.3	39.2
	MATH-500 (EM)	56.3	74.7	80.0	73.8	78.3	74.6	90.2
	CNMO 2024 (Pass@1)	2.8	10.8	15.9	6.8	13.1	10.8	43.2
Chinese	CLUEWSC (EM)	89.9	90.4	91.4	84.7	85.4	87.9	90.9
	C-Eval (EM)	78.6	79.5	86.1	61.5	76.7	76.0	86.5
	C-SimpleQA (Correct)	48.5	54.1	48.4	50.4	51.3	59.3	64.8

注意：所有模型均在將輸出長度限制為 8K 的配置下進行評估,。對于包含少于 1000 個樣本的基準測試，會使用不同的溫度設(shè)置多次進行測試，以得出可靠的最終結(jié)果。DeepSeek-V3 是表現(xiàn)最佳的開源模型，并且在與前沿的閉源模型的對比中也展現(xiàn)出具有競爭力的性能,。

Open Ended Generation Evaluation

Model	Arena-Hard	AlpacaEval 2.0
DeepSeek-V2.5-0905	76.2	50.5
Qwen2.5-72B-Instruct	81.2	49.1
LLaMA-3.1 405B	69.3	40.5
GPT-4o-0513	80.4	51.1
Claude-Sonnet-3.5-1022	85.2	52.0
DeepSeek-V3	85.5	70.0

注意：英語開放式對話評估,。對于 AlpacaEval 2.0，我們使用長度控制下的勝率作為指標。

DeepSeek-V3的安裝和使用方法

DeepSeek-V3 提供了多種本地運行方式，但HuggingFace's Transformers 尚未直接支持,。

1,、安裝

克隆倉庫并安裝依賴

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference
pip install -r requirements.txt

下載模型權(quán)重

從 HuggingFace 下載模型權(quán)重，并將其放入 /path/to/DeepSeek-V3 文件夾。

Model	#Total Params	#Activated Params	Context Length	Download
DeepSeek-V3-Base	671B	37B	128K	🤗 HuggingFace
DeepSeek-V3	671B	37B	128K	🤗 HuggingFace

模型權(quán)重轉(zhuǎn)換 (DeepSeek-Infer Demo 示例)

python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16

2,、模型推理

DeepSeek-V3 可以使用以下硬件和開源社區(qū)軟件在本地部署：

>> DeepSeek-Infer 演示：我們?yōu)?FP8 和 BF16 推理提供了一個簡單,、輕量級的演示。

>> SGLang：完全支持 BF16 和 FP8 推理模式下的 DeepSeek-V3 模型,。

>> LMDeploy：支持本地和云部署的高效 FP8 和 BF16 推理,。

>> TensorRT-LLM：目前支持 BF16 推理和 INT4/8 量化，即將支持 FP8。

>> AMD GPU：支持在 BF16 和 FP8 模式下通過 SGLang 在 AMD GPU 上運行 DeepSeek-V3 模型,。

>> 華為Ascend NPU：支持在華為Ascend設(shè)備上運行DeepSeek-V3,。

除了以下兩種，項目還推薦使用 SGLang、LMDeploy 和 TensorRT-LLM 等框架進行推理，并提供了相應(yīng)的鏈接和說明,。?SGLang 特別支持 AMD GPU,。華為 Ascend NPU 的支持也通過 MindIE 框架實現(xiàn)。如果需要 BF16 權(quán)重，可以使用提供的轉(zhuǎn)換腳本進行轉(zhuǎn)換,。

運行推理 (DeepSeek-Infer Demo 示例，交互式)：

torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200

運行推理 (DeepSeek-Infer Demo 示例，批量)：

torchrun --nnodes 2 --nproc-per-node 8 generate.py --node-rank $RANK --master-addr $ADDR --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE