背景
本文接著前一篇(原理及性能損耗)繼續(xù)分析PTI的實際代碼,,其實比較簡單,,供大家和自己后續(xù)參考。
代碼分析
PTI功能的代碼相對比較獨立,,看似跟內(nèi)核中的mm基礎(chǔ)架構(gòu)緊密相關(guān),,但實際對mm的基礎(chǔ)代碼改動較少,耦合度較低,這也是其設(shè)計的美妙之處,。
在kernel master分支代碼中,主要涉及如下兩個補?。?/p>
aa8c6248f8c75acfd610fe15d8cae23cf70d9d09:
commit aa8c6248f8c75acfd610fe15d8cae23cf70d9d09
Author: Thomas Gleixner <[email protected]>
Date: Mon Dec 4 15:07:36 2017 +0100
x86/mm/pti: Add infrastructure for page table isolation
Add the initial files for kernel page table isolation, with a minimal init
function and the boot time detection for this misfeature.
8a09317b895f073977346779df52f67c1056d81d:
commit 8a09317b895f073977346779df52f67c1056d81d
Author: Dave Hansen <[email protected]>
Date: Mon Dec 4 15:07:35 2017 +0100
x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching
PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when it
enters the kernel and switch back when it exits. This essentially needs to
be done before leaving assembly code.
This is extra challenging because the switching context is tricky: the
registers that can be clobbered can vary. It is also hard to store things
on the stack because there is an established ABI (ptregs) or the stack is
entirely unsafe to use.
Establish a set of macros that allow changing to the user and kernel CR3
values.
Interactions with SWAPGS:
Previous versions of the PAGE_TABLE_ISOLATION code relied on having
per-CPU scratch space to save/restore a register that can be used for the
CR3 MOV. The %GS register is used to index into our per-CPU space, so
SWAPGS *had* to be done before the CR3 switch. That scratch space is gone
now, but the semantic that SWAPGS must be done before the CR3 MOV is
retained. This is good to keep because it is not that hard to do and it
allows to do things like add per-CPU debugging information.
What this does in the NMI code is worth pointing out. NMIs can interrupt
*any* context and they can also be nested with NMIs interrupting other
NMIs. The comments below ".Lnmi_from_kernel" explain the format of the
stack during this situation. Changing the format of this stack is hard.
Instead of storing the old CR3 value on the stack, this depends on the
*regular* register save/restore mechanism and then uses %r14 to keep CR3
during the NMI. It is callee-saved and will not be clobbered by the C NMI
handlers that get called.
commit log中的描述寫的比較清楚,,大家可以仔細理解下。
如果想看這兩個補丁的相信情況,,可以在git倉庫中之間執(zhí)行如下兩個命令查看即可:
#git log -p aa8c6248f8c75acfd610fe15d8cae23cf70d9d09
#git log -p 8a09317b895f073977346779df52f67c1056d81d
總體來看,,這兩個補丁主要修改了如下文件:
- Documentation/admin-guide/kernel-parameters.txt //添加啟動參數(shù)說明
- arch/x86/entry/calling.h //頁表切換相關(guān)匯編代碼
- arch/x86/entry/entry_64.S&entry_64_compat.S //用戶態(tài)和內(nèi)核態(tài)切換時添加頁表切換相關(guān)操作
- arch/x86/include/asm/pti.h //新增。pti相關(guān)接口,、宏定義
- arch/x86/mm/init.c //內(nèi)核初始化時,,檢查pti相關(guān)的啟動參數(shù)
- arch/x86/mm/pti.c //新增。Pti初始化,、參數(shù)檢查等基本功能
- init/main.c //初始化pti
接下來逐個分析相關(guān)文件的修改,。
kernel-parameters.txt
該文件中的修改如下:
+ nopti [X86-64] Disable kernel page table isolation
+
很簡單,只是添加了啟動參數(shù)nopti的說明,,用于關(guān)閉pti功能,。
calling.h
兩個補丁都對這個文件進行了修改,主要是添加了頁表切換相關(guān)匯編代碼,,供用戶態(tài)/內(nèi)核態(tài)切換時調(diào)用,。
第一個補丁的修改如下:
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3fd8bc5..a9d17a7 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,8 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/jump_label.h>
#include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
+#include <asm/page_types.h>
/*
@@ -187,6 +189,70 @@ For 32-bit we have the following conventions - kernel is built with
#endif
.endm
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
* CR3中的第13位(bit 12)用于切換內(nèi)核態(tài)和用戶態(tài)頁表,當該位clear時,,表示使用內(nèi)核頁表
* 當該位set時,,表示使用用戶頁表,也就是說用戶態(tài)和內(nèi)核態(tài)的PGD時相鄰的兩頁,。
*/
+/* PAGE_TABLE_ISOLATION PGDs are 8k. Flip bit 12 to switch between the two halves: */
+#define PTI_SWITCH_MASK (1<<PAGE_SHIFT)
+/*調(diào)整內(nèi)核的CR3寄存器內(nèi)容,,在進入內(nèi)核態(tài)時使用,本質(zhì)為clear第13位,,將頁表切換為內(nèi)核頁表*/
+.macro ADJUST_KERNEL_CR3 reg:req
+ /* Clear "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
+ andq $(~PTI_SWITCH_MASK), \reg
+.endm
+/*與上面類似,,本質(zhì)為set 13位,將頁表切換為用戶頁表*/
+.macro ADJUST_USER_CR3 reg:req
+ /* Move CR3 up a page to the user page tables: */
+ orq $(PTI_SWITCH_MASK), \reg
+.endm
+/*切換到內(nèi)核的CR3,進入內(nèi)核態(tài)時調(diào)用,,通過調(diào)用ADJUST_KERNEL_CR3實現(xiàn),,最后將修改后內(nèi)容寫入CR3,實現(xiàn)頁表切換*/
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+ mov %cr3, \scratch_reg
+ ADJUST_KERNEL_CR3 \scratch_reg
+ mov \scratch_reg, %cr3
+.endm
+/*與上面類似,,切換到用戶態(tài)的CR3*/
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+ mov %cr3, \scratch_reg
+ ADJUST_USER_CR3 \scratch_reg
+ mov \scratch_reg, %cr3
+.endm
+/*與SWITCH_TO_KERNEL_CR3類似,,只是先保存了CR3*/
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+ movq %cr3, \scratch_reg
+ movq \scratch_reg, \save_reg
+ /*
+ * Is the switch bit zero? This means the address is
+ * up in real PAGE_TABLE_ISOLATION patches in a moment.
+ */
+ testq $(PTI_SWITCH_MASK), \scratch_reg
+ jz .Ldone_\@
+
+ ADJUST_KERNEL_CR3 \scratch_reg
+ movq \scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+/*通過前面保存的CR3內(nèi)容,恢復(fù)CR3*/
+.macro RESTORE_CR3 save_reg:req
+ /*
+ * The CR3 write could be avoided when not changing its value,
+ * but would require a CR3 read *and* a scratch register.
+ */
+ movq \save_reg, %cr3
+.endm
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
第二個補丁對該文件進行了優(yōu)化,,主要添加了開關(guān)(X86_FEATURE_PTI),,用于控制是否走pti相關(guān)流程。
diff –git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a9d17a7..3d3389a 100644
— a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -205,18 +205,23 @@ For 32-bit we have the following conventions - kernel is built with
.endm
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
- ALTERNATIVE “jmp .Lend_\@”, “”, X86FEATURE_PTI
mov %cr3, \scratch_reg
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
+.Lend\@:
.endm
.macro SWITCH_TO_USER_CR3 scratch_reg:req
- ALTERNATIVE “jmp .Lend_\@”, “”, X86FEATURE_PTI
mov %cr3, \scratch_reg
ADJUST_USER_CR3 \scratch_reg
mov \scratch_reg, %cr3
+.Lend\@:
.endm
.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
- ALTERNATIVE “jmp .Ldone_\@”, “”, X86_FEATURE_PTI
movq %cr3, \scratch_reg
movq \scratch_reg, \save_reg
/*
@@ -233,11 +238,13 @@ For 32-bit we have the following conventions - kernel is built with
.endm
.macro RESTORE_CR3 save_reg:req
- ALTERNATIVE “jmp .Lend_\@”, “”, X86_FEATURE_PTI
/*
- The CR3 write could be avoided when not changing its value,
- but would require a CR3 read and a scratch register.
*/
movq \save_reg, %cr3
+.Lend_\@:
.endm
代碼比較簡單,就不解釋了,。
entry_64.S
這部分代碼用于在用戶態(tài)和內(nèi)核態(tài)切換時添加頁表切換相關(guān)操作,。
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 87cebe7..2ad7ad4 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -164,6 +164,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
/* Stash the user RSP. */
movq %rsp, RSP_SCRATCH
/*
*系統(tǒng)調(diào)用入口處,添加頁表切換操作,,需要切換到內(nèi)核頁表,,具體切換動作定義在calling.h中
*注意,這是trampoline流程,,不是常規(guī)的系統(tǒng)調(diào)用入口流程,,原因是使用PTI后,就不走常規(guī)
*流程了,,為了減少內(nèi)核地址映射,,見我的前一篇關(guān)于PTI原理的文章。
*/
+ /* Note: using %rsp as a scratch reg. */
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
/* Load the top of the task stack into RSP */
movq CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
@@ -203,6 +206,10 @@ ENTRY(entry_SYSCALL_64)
*/
swapgs
/* 使用PTI后,,就不走常規(guī)流程了,,為了減少內(nèi)核地址映射,見我的前一篇關(guān)于PTI原理的文章 */
+ /*
+ * This path is not taken when PAGE_TABLE_ISOLATION is disabled so it
+ * is not required to switch CR3.
+ */
movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
@@ -399,6 +406,7 @@ syscall_return_via_sysret:
* We are on the trampoline stack. All regs except RDI are live.
* We can do future final exit work right here.
*/
/*系統(tǒng)調(diào)用返回,,此時需要切換到用戶頁表*/
+ SWITCH_TO_USER_CR3 scratch_reg=%rdi
popq %rdi
popq %rsp
@@ -736,6 +744,8 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
* We can do future final exit work right here.
*/
/*中斷返回用戶態(tài),,需要切換到用戶頁表*/
+ SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
/* Restore RDI. */
popq %rdi
SWAPGS
@@ -818,7 +828,9 @@ native_irq_return_ldt:
*/
pushq %rdi /* Stash user RDI */
- SWAPGS
/*中斷返回內(nèi)核態(tài),需要切換到kernel GS和內(nèi)核頁表*/
+ SWAPGS /* to kernel GS */
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi /* to kernel CR3 */
+
...
ENTRY(nmi)
UNWIND_HINT_IRET_REGS
@@ -1446,6 +1476,7 @@ ENTRY(nmi)
swapgs
cld
/*nmi中斷入口,,需要切換到內(nèi)核頁表*/
+ SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
movq %rsp, %rdx
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1698,6 +1729,8 @@ end_repeat_nmi:
movq $-1, %rsi
call do_nmi
...
pti.h
diff --git a/arch/x86/include/asm/pti.h b/arch/x86/include/asm/pti.h
new file mode 100644
index 0000000..0b5ef05
--- /dev/null
+++ b/arch/x86/include/asm/pti.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _ASM_X86_PTI_H
+#define _ASM_X86_PTI_H
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+extern void pti_init(void);
+extern void pti_check_boottime_disable(void);
+#else
+static inline void pti_check_boottime_disable(void) { }
+#endif
+
+#endif /* __ASSEMBLY__ */
+#endif /* _ASM_X86_PTI_H */
一些函數(shù)聲明,,不解釋了
pti.c
Pti初始化、參數(shù)檢查等基本功能:
+#undef pr_fmt
+#define pr_fmt(fmt) "Kernel/User page tables isolation: " fmt
+// 打印警告信息
+static void __init pti_print_if_insecure(const char *reason)
+{
+ if (boot_cpu_has_bug(X86_BUG_CPU_INSECURE))
+ pr_info("%s\n", reason);
+}
+//檢查內(nèi)核啟動參數(shù)
+void __init pti_check_boottime_disable(void)
+{
+ if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
+ pti_print_if_insecure("disabled on XEN PV.");
+ return;
+ }
+
+ if (cmdline_find_option_bool(boot_command_line, "nopti")) {
+ pti_print_if_insecure("disabled on command line.");
+ return;
+ }
+
+ if (!boot_cpu_has_bug(X86_BUG_CPU_INSECURE))
+ return;
+
+ setup_force_cpu_cap(X86_FEATURE_PTI);
+}
+
+/*
+ * Initialize kernel page table isolation
+ */
//pti初始化
+void __init pti_init(void)
+{
+ if (!static_cpu_has(X86_FEATURE_PTI))
+ return;
+
+ pr_info("enabled\n");
+}
代碼依然很簡單,,不解釋了,。后續(xù)針對pti功能提交了很多的修復(fù)、優(yōu)化補丁,,最新版本中的pti.c文件比這個復(fù)雜很多,,大家可以自己去看看,也可以通過如下的命令追蹤該文件的修改歷史:
#git log -p arch/x86/mm/pti.c
本文主要針對pti的基本代碼,,所以,,就不深入了。
init.c
+#include <asm/pti.h>
/*
* We need to define the tracepoints somewhere, and tlb.c
@@ -630,6 +631,7 @@ void __init init_mem_mapping(void)
{
unsigned long end;
//在內(nèi)核初始化階段,,檢查kernel啟動參數(shù),。
+ pti_check_boottime_disable();
probe_page_size_mask();
setup_pcid();
main.c
diff –git a/init/main.c b/init/main.c
index 8a390f6..b32ec72 100644
— a/init/main.c
+++ b/init/main.c
@@ -75,6 +75,7 @@
#include <linux/slab.h>
#include <linux/perf_event.h>
#include <linux/ptrace.h>
+#include <linux/pti.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
#include <linux/sched_clock.h>
@@ -506,6 +507,8 @@ static void __init mm_init(void)
ioremap_huge_init();
/* Should be run before the first non-init thread is created */
init_espfix_bsp();
//pti初始化
- /* Should be run after espfix64 is set up. */
- pti_init();
}
不解釋了。
總體看,,Pti的代碼非常簡單直接,,后續(xù)有較多的修復(fù)、優(yōu)化,,也不復(fù)雜,,留作大家自己學習,。
|