深入理解Linux進(jìn)程管理

mynotebook 2022-11-09 發(fā)布于湖南

展開全文

作者簡(jiǎn)介：

程磊，某手機(jī)大廠系統(tǒng)開發(fā)工程師,，閱碼場(chǎng)榮譽(yù)總編輯,，最大的愛好是鉆研Linux內(nèi)核基本原理。

一,、進(jìn)程基本概念

進(jìn)程是計(jì)算機(jī)里面最重要的概念之一。操作系統(tǒng)的目的就是為了運(yùn)行進(jìn)程,。那么到底什么是進(jìn)程,，操作系統(tǒng)又是如何實(shí)現(xiàn)進(jìn)程和管理進(jìn)程的呢？

1.1 進(jìn)程與程序

進(jìn)程是程序的執(zhí)行過程,。程序是靜態(tài)的,，是存在于外存之中的，電腦關(guān)機(jī)后依然存在,。進(jìn)程是動(dòng)態(tài)的，是存在于內(nèi)存之中的,，是程序的執(zhí)行過程,，電腦關(guān)機(jī)后就不存在進(jìn)程了。進(jìn)程的內(nèi)容來源于程序,，進(jìn)程的啟動(dòng)過程就是把程序從外存加載到內(nèi)存的過程,。程序文件是有格式的，UNIX-Like操作系統(tǒng)的通用程序文件格式是ELF,。程序文件是從源碼文件編譯過來的,，源碼文件很多是用C或者C++書寫的。關(guān)于編譯系統(tǒng),，請(qǐng)參看《深入理解編譯系統(tǒng)》,，關(guān)于C和C++，請(qǐng)參看《深入理解C與C++》,。

1.2 進(jìn)程與線程

進(jìn)程是操作系統(tǒng)分配和管理系統(tǒng)資源的基本單位,。進(jìn)程本來也是程序執(zhí)行的基本單位，但是自從有了線程之后就不是了?，F(xiàn)在線程是程序執(zhí)行的基本單位,，代表一個(gè)執(zhí)行流，一個(gè)進(jìn)程可以有多個(gè)執(zhí)行流,。最初的時(shí)候,，一個(gè)進(jìn)程就只有一個(gè)執(zhí)行流,，也就是主線程，此時(shí)進(jìn)程就是線程,，線程就是進(jìn)程,。當(dāng)程序需要多個(gè)執(zhí)行流的時(shí)候，采取的都是多進(jìn)程的方式,。但是創(chuàng)建一個(gè)新進(jìn)程是一個(gè)很耗費(fèi)資源的事情,，而且多個(gè)進(jìn)程之間還要進(jìn)行進(jìn)程間通信也很費(fèi)事。于是人們便想到了開發(fā)進(jìn)程內(nèi)并發(fā)機(jī)制,，也就是在一個(gè)進(jìn)程內(nèi)能同時(shí)存在多個(gè)執(zhí)行流(線程),。不同的人設(shè)計(jì)的進(jìn)程內(nèi)并發(fā)機(jī)制并不相同,。按照線程的管理是否實(shí)現(xiàn)在內(nèi)核里,，進(jìn)程內(nèi)并發(fā)機(jī)制可以分為兩大類,，分別是內(nèi)核級(jí)線程(內(nèi)核級(jí)線程也被叫做輕量級(jí)進(jìn)程)和用戶級(jí)線程,，注意這兩個(gè)名詞都帶個(gè)級(jí),，它們是進(jìn)程內(nèi)并發(fā)機(jī)制的兩個(gè)子類,，并不是具體的線程,。內(nèi)核級(jí)線程下的線程,，按照運(yùn)行主體是在內(nèi)核空間還是在用戶空間可以分為內(nèi)核線程和用戶線程,。用戶級(jí)線程下的線程,，按照運(yùn)行主體是在內(nèi)核空間還是在用戶空間也可以分為內(nèi)核線程和用戶線程，但是由于用戶級(jí)線程實(shí)現(xiàn)在用戶空間,，所以它的線程不可能存在于內(nèi)核空間,。內(nèi)核級(jí)線程下的用戶線程一般被叫做用戶線程，簡(jiǎn)稱線程,。用戶級(jí)線程下的用戶線程如果再叫用戶線程或者線程就會(huì)產(chǎn)生混淆,，于是就被叫做協(xié)程或者纖程。如下圖所示：

這兩種實(shí)現(xiàn)多線程的方法各有優(yōu)缺點(diǎn),。在用戶空間實(shí)現(xiàn)的話,，優(yōu)點(diǎn)是簡(jiǎn)單，不用改內(nèi)核,，只需要實(shí)現(xiàn)一個(gè)庫就行了,，創(chuàng)建線程開銷小，缺點(diǎn)是線程之間做不到真并發(fā),，一個(gè)線程阻塞就會(huì)阻塞同一進(jìn)程的所有其它線程,。在內(nèi)核空間實(shí)現(xiàn)的話，缺點(diǎn)是麻煩,，需要改內(nèi)核,，創(chuàng)建線程開銷大，但是優(yōu)點(diǎn)是能做到真并發(fā)，一個(gè)進(jìn)程的多個(gè)線程可以同時(shí)在多個(gè)CPU上運(yùn)行,，能充分利用CPU,。當(dāng)然這兩者并不是對(duì)立的，它們可以同時(shí)實(shí)現(xiàn),，一個(gè)進(jìn)程可以有多個(gè)內(nèi)核級(jí)線程,，一個(gè)內(nèi)核級(jí)線程又可以有多個(gè)用戶級(jí)線程，編程者可以靈活選擇使用哪種多線程方式,。

1.3 進(jìn)程與內(nèi)核

進(jìn)程與內(nèi)核在同一個(gè)虛擬地址空間中,，但是在不同的子空間，進(jìn)程是在用戶空間,，內(nèi)核是在內(nèi)核空間,。整個(gè)系統(tǒng)只有一個(gè)內(nèi)核空間，但是卻有很多用戶空間,，不過當(dāng)前用戶空間永遠(yuǎn)只有一個(gè)(對(duì)于一個(gè)CPU來說),。雖然內(nèi)核空間和用戶空間在同一個(gè)空間中，但是它們的權(quán)限并不相同,。內(nèi)核空間處于特權(quán)模式,，用戶空間處于非特權(quán)模式。內(nèi)核可以隨意訪問和操作用戶空間,，但是用戶空間對(duì)內(nèi)核空間卻是看得見摸不著,。內(nèi)核空間可以做很多特權(quán)操作，用戶空間沒有權(quán)限做,，但是有些時(shí)候又需要做,，所以內(nèi)核為用戶空間開了一個(gè)口子，就是系統(tǒng)調(diào)用,，用戶空間可以通過系統(tǒng)調(diào)用來請(qǐng)求內(nèi)核的服務(wù),。關(guān)于系統(tǒng)調(diào)用請(qǐng)參看《深入理解Linux系統(tǒng)調(diào)用與API》,。

下面我們用一張圖來總結(jié)內(nèi)核和進(jìn)程之間的關(guān)系：

這個(gè)圖是在講進(jìn)程調(diào)度的時(shí)候畫的,，但是用在這里表示進(jìn)程和內(nèi)核的關(guān)系也很合適。

1.4 進(jìn)程與內(nèi)存

對(duì)于內(nèi)核來說,，內(nèi)存是有虛擬內(nèi)存和物理內(nèi)存之分的,。但是對(duì)于進(jìn)程來說，這些都是透明的,，進(jìn)程只需要知道自己獨(dú)占一個(gè)用戶空間的內(nèi)存就可以了,，它不知道也不需要知道自己是否運(yùn)行在虛擬內(nèi)存上。如果非要說進(jìn)程知道物理內(nèi)存和虛擬內(nèi)存,，那么進(jìn)程也只能分配和管理虛擬內(nèi)存,，它沒法分配管理物理內(nèi)存，因?yàn)槲锢韮?nèi)存對(duì)它來說是透明的,。內(nèi)核在合適的時(shí)候會(huì)為進(jìn)程分配相應(yīng)的物理內(nèi)存,，保證進(jìn)程在訪問內(nèi)存的時(shí)候一定會(huì)有對(duì)應(yīng)的物理內(nèi)存,，但是進(jìn)程對(duì)此毫不知情，也管不了,。

進(jìn)程需要內(nèi)存的時(shí)候可以通過系統(tǒng)調(diào)用brk,、sbrk、mmap來向內(nèi)核申請(qǐng)分配虛擬內(nèi)存,。但是直接使用系統(tǒng)調(diào)用來分配管理內(nèi)存顯然很麻煩效率也低,，為此libc向進(jìn)程提供了malloc庫，malloc提供了malloc,、free等幾個(gè)接口供進(jìn)程使用,。這樣進(jìn)程需要內(nèi)存的時(shí)候就可以直接使用malloc去分配內(nèi)存，使用完了就用free去釋放內(nèi)存,，不用考慮分配效率,、內(nèi)存碎片等問題了。目前比較流行的malloc庫有ptmalloc,、jemalloc,、scudo等。

1.5 進(jìn)程運(yùn)行狀態(tài)

很多操作系統(tǒng)的書籍上都會(huì)講進(jìn)程的運(yùn)行狀態(tài),，有的講的是三態(tài),，有的講的是五態(tài)。其實(shí)兩者并不矛盾,，三態(tài)只有進(jìn)程運(yùn)行時(shí)的狀態(tài),，五態(tài)把進(jìn)程的新建和死亡狀態(tài)也算上去了，如下圖所示：

進(jìn)程剛創(chuàng)建之后處于新建態(tài),，但是新建態(tài)不是持久狀態(tài),，它會(huì)立馬轉(zhuǎn)變?yōu)榫途w狀態(tài)。然后進(jìn)程就會(huì)一直處于就緒,、執(zhí)行,、阻塞三態(tài)的循環(huán)之中。就緒態(tài)會(huì)由于進(jìn)程調(diào)度而轉(zhuǎn)為執(zhí)行態(tài),；執(zhí)行態(tài)會(huì)由于時(shí)間片耗盡而轉(zhuǎn)為就緒態(tài),，也會(huì)由于等待某個(gè)事件而轉(zhuǎn)為阻塞態(tài)；阻塞態(tài)會(huì)由于某個(gè)事件的發(fā)生而轉(zhuǎn)為就緒態(tài),。最后進(jìn)程可能會(huì)由于主動(dòng)退出或者發(fā)生異常而死亡,。死亡態(tài)也不是一個(gè)持久態(tài)，進(jìn)程死亡之后就不存在了,。

1.6 進(jìn)程親緣關(guān)系

所有進(jìn)程都通過父子關(guān)系連接而構(gòu)成一顆親緣樹,，這顆樹的樹根是init進(jìn)程(pid
1)。Init進(jìn)程是第一個(gè)用戶空間進(jìn)程，所有的用戶空間進(jìn)程都是init進(jìn)程的子孫進(jìn)程,。Init進(jìn)程的父進(jìn)程是零號(hào)進(jìn)程,，零號(hào)進(jìn)程是在代碼中通過硬編碼創(chuàng)建的，其它所有的進(jìn)程都是通過fork創(chuàng)建的,。這里為什么叫做零號(hào)進(jìn)程呢,？因?yàn)榱闾?hào)進(jìn)程的職責(zé)發(fā)生過變化，在系統(tǒng)剛啟動(dòng)的時(shí)候,，零號(hào)進(jìn)程是BSP(bootstrap
process),，start_kernel函數(shù)就是在零號(hào)進(jìn)程中運(yùn)行的。當(dāng)系統(tǒng)初始化完成的時(shí)候,，零號(hào)進(jìn)程退化為了idle進(jìn)程,。當(dāng)我們只強(qiáng)調(diào)零號(hào)進(jìn)程的身份而不關(guān)心它的職責(zé)的時(shí)候，就叫它零號(hào)進(jìn)程,。當(dāng)后面我們強(qiáng)調(diào)它的idle職責(zé)的時(shí)候,，就叫它idle進(jìn)程。

零號(hào)進(jìn)程有兩個(gè)親兒子,，除了init之外,，還有一個(gè)是kthreadd(pid
2)。Kthreadd是一個(gè)內(nèi)核線程,，它是所有其它內(nèi)核線程的父進(jìn)程,。內(nèi)核線程比較特殊的點(diǎn)在于它只運(yùn)行在內(nèi)核空間，所以所有的內(nèi)核線程都可以看做是同一個(gè)進(jìn)程下的線程,，因?yàn)閮?nèi)核空間只有一個(gè),。但是每個(gè)內(nèi)核線程在邏輯意義上又是一個(gè)獨(dú)立的進(jìn)程，它們執(zhí)行獨(dú)立的任務(wù),，有著獨(dú)立的進(jìn)程人格,。所以當(dāng)我們說一個(gè)內(nèi)核線程的時(shí)候，心里也要明白它是一個(gè)單獨(dú)的進(jìn)程,，是一個(gè)只有主線程的單線程進(jìn)程,。

我們來畫一下進(jìn)程的親緣關(guān)系：

進(jìn)程除了父子這種血緣關(guān)系之外，還存在著家族關(guān)系,。一個(gè)是大家族關(guān)系,，會(huì)話組(session),，一個(gè)是小家族關(guān)系,，進(jìn)程組(process
group)。會(huì)話組的產(chǎn)生來源于早期的大型計(jì)算機(jī),，當(dāng)時(shí)一個(gè)公司或者一個(gè)科研單位只能買得起一臺(tái)大型機(jī),。然后每個(gè)人都通過一個(gè)終端連接到這個(gè)大型機(jī)，用自己的用戶名和密碼登錄上去。每個(gè)用戶都有自己的用戶id,，一個(gè)用戶運(yùn)行的所有的程序構(gòu)成了一個(gè)會(huì)話組,。有了會(huì)話組的概念，就可以方便我們把一個(gè)用戶運(yùn)行的所有進(jìn)程作為一個(gè)整體進(jìn)行管理,。進(jìn)程組的產(chǎn)生來源于命令行操作的作業(yè)管理,。什么是作業(yè)管理呢？就是把一行命令的執(zhí)行整體作為一個(gè)作業(yè),。一行命令的執(zhí)行不一定只有一個(gè)進(jìn)程,，比如命令
ps -ef | grep
bash，就有兩個(gè)進(jìn)程,，我們需要有個(gè)概念把這兩個(gè)進(jìn)程作為一個(gè)整體來處理,，這個(gè)概念就是進(jìn)程組。有了進(jìn)程組的概念,，作業(yè)管理就比較方便了,，比如Ctrl+C就是給當(dāng)前正在執(zhí)行的命令(進(jìn)程組)發(fā)信號(hào)，進(jìn)程組中的每個(gè)進(jìn)程都會(huì)收到信號(hào),。

一個(gè)進(jìn)程誕生的時(shí)候默認(rèn)繼承父進(jìn)程的會(huì)話組和進(jìn)程組,，但是進(jìn)程可以通過系統(tǒng)調(diào)用(setsid，setpgrp)創(chuàng)建新的會(huì)話組或者進(jìn)程組,。會(huì)話組的第一個(gè)進(jìn)程叫做這個(gè)會(huì)話組的組長(zhǎng),，進(jìn)程組的第一個(gè)進(jìn)程叫做這個(gè)進(jìn)程組的組長(zhǎng)，會(huì)話組的id等于會(huì)話組組長(zhǎng)的pid,，進(jìn)程組的id等于進(jìn)程組組長(zhǎng)的pid,。一個(gè)進(jìn)程只有當(dāng)它不是某個(gè)進(jìn)程組組長(zhǎng)的時(shí)候，它才可以調(diào)用setpgrp創(chuàng)建新的進(jìn)程組,，同時(shí)它也成為了這個(gè)新建的進(jìn)程組的組長(zhǎng),。這個(gè)也很好理解，只有臣子造反當(dāng)皇帝,，哪有皇帝自己造自己的反重新創(chuàng)建一個(gè)朝代的,。同理，只有不是會(huì)話組組長(zhǎng)的進(jìn)程才能通過setsid創(chuàng)建新的會(huì)話組,，并成為這個(gè)會(huì)話組組長(zhǎng),。而且在這個(gè)新的會(huì)話組里也不能沒有進(jìn)程組啊，于是還會(huì)創(chuàng)建一個(gè)進(jìn)程組,，這個(gè)會(huì)話組組長(zhǎng)還會(huì)成為這個(gè)新建的進(jìn)程組的組長(zhǎng),，這也要求了這個(gè)進(jìn)程之前不能是進(jìn)程組組長(zhǎng)。所以只有既不是進(jìn)程組組長(zhǎng)又不是會(huì)話組組長(zhǎng)的進(jìn)程才能創(chuàng)建新的會(huì)話組,。

任何一個(gè)進(jìn)程,，它必然屬于某個(gè)進(jìn)程組,，而且只能同時(shí)屬于一個(gè)進(jìn)程組。任何一個(gè)進(jìn)程,，它必然屬于某個(gè)會(huì)話組,，而且只能屬于一個(gè)會(huì)話組。任何一個(gè)進(jìn)程組,，它的所有進(jìn)程必須都屬于同一個(gè)會(huì)話組,。一個(gè)進(jìn)程所屬的會(huì)話組只有兩種來源，要么是繼承而來的,，要么是自己創(chuàng)建的,，進(jìn)程是不能轉(zhuǎn)會(huì)話組的。不過一個(gè)進(jìn)程是可以轉(zhuǎn)進(jìn)程組的,，但是只能在同一個(gè)會(huì)話組中的進(jìn)程組之間轉(zhuǎn),。因此我們可以得出一個(gè)結(jié)論，一個(gè)會(huì)話組的所有進(jìn)程肯定都是其會(huì)話組組長(zhǎng)的子孫進(jìn)程,，一個(gè)進(jìn)程組的所有進(jìn)程一般情況下都是其進(jìn)程組組長(zhǎng)的子孫進(jìn)程,。

我們來畫一下進(jìn)程的家族關(guān)系：

二、進(jìn)程的實(shí)現(xiàn)

明白了進(jìn)程的基本概念之后,，我們來看一看Linux是怎么實(shí)現(xiàn)進(jìn)程的,。按照標(biāo)準(zhǔn)的操作系統(tǒng)理論，進(jìn)程是資源分配的單位,，線程是程序執(zhí)行的單位,，內(nèi)核里用進(jìn)程控制塊(PCB Process Control Block)來管理進(jìn)程，用線程控制塊(TCB Thread Control Block)來管理線程,。那么Linux是按照這個(gè)邏輯來實(shí)現(xiàn)進(jìn)程的嗎,？我們來看一下。

2.1 基本原理

Linux內(nèi)核并不是按照標(biāo)準(zhǔn)的操作系統(tǒng)理論來實(shí)現(xiàn)進(jìn)程的,，在內(nèi)核里找不到典型的進(jìn)程控制塊和線程控制塊,。內(nèi)核里只有一個(gè)task_struct結(jié)構(gòu)體，初學(xué)內(nèi)核的人會(huì)很疑惑這是代表進(jìn)程還是代表線程呢,。之所以會(huì)這樣,，是由于歷史原因造成的。Linux最開始的時(shí)候是不支持多線程的,，也可以認(rèn)為此時(shí)一個(gè)進(jìn)程只能有一個(gè)線程就是主線程,，因此線程就是進(jìn)程，進(jìn)程就是線程,。所以最初的時(shí)候,，task_struct既代表進(jìn)程又代表線程，因?yàn)檫M(jìn)程和線程沒有區(qū)別,。但是后來Linux也要支持多線程了,，我們?cè)?.2節(jié)中討論過，多線程的實(shí)現(xiàn)方法可以在內(nèi)核實(shí)現(xiàn),，也可以在用戶空間實(shí)現(xiàn),，也可以同時(shí)實(shí)現(xiàn)，Linux選擇的是在內(nèi)核實(shí)現(xiàn),。為了最大限度地利用已有的代碼,，盡量不對(duì)代碼做大的改動(dòng)，Linux選擇的方法是：task_struct既是線程又是進(jìn)程的代理,。注意這句話,，task_struct既是線程又是進(jìn)程的代理(不是進(jìn)程本身)。Linux并沒有設(shè)計(jì)單獨(dú)的進(jìn)程結(jié)構(gòu)體,，而是用task_struct作為進(jìn)程的代理,，這是因?yàn)檫M(jìn)程是資源分配的單位，線程是程序執(zhí)行的單位,，同一個(gè)進(jìn)程的所有線程共享相同的資源,，因此我們讓同一個(gè)進(jìn)程下的所有線程(task_struct)都指向相同的資源不就可以了嘛。線程在執(zhí)行的時(shí)候會(huì)通過task_struct里面的指針訪問資源,，同一個(gè)進(jìn)程下的線程自然就會(huì)訪問到相同的資源,，而且這么做還有很大的靈活性。

我們下面再來強(qiáng)調(diào)一下這句話,，以加深對(duì)這句話的理解,。

task_struct既是線程又是進(jìn)程的代理(不是進(jìn)程本身)。

2.2 進(jìn)程結(jié)構(gòu)體

當(dāng)我們明白了task_struct既是線程又是進(jìn)程的代理之后,，再來理解task_struct就容易多了,。task_struct的字段由兩部分組成，一部分是線程相關(guān)的,，一部分是進(jìn)程相關(guān)的,，線程相關(guān)的一般是直接內(nèi)嵌其它數(shù)據(jù)，進(jìn)程相關(guān)的一般是用指針指向其它數(shù)據(jù),。線程代表的是執(zhí)行流,，所以task_struct的線程相關(guān)部分是和執(zhí)行有關(guān)的，進(jìn)程代表的是資源分配,，所以task_struct的進(jìn)程相關(guān)部分是和資源有關(guān)的,。我們可以想一下和執(zhí)行有關(guān)的都有哪些，和資源有關(guān)的都哪些,？可以很輕松地想到,，和執(zhí)行有關(guān)的肯定是進(jìn)程調(diào)度相關(guān)的數(shù)據(jù)啊(進(jìn)程調(diào)度雖然叫進(jìn)程調(diào)度，但實(shí)際上調(diào)度的是線程),。和資源相關(guān)的,，最重要的首先肯定是虛擬內(nèi)存啊,，其次是文件系統(tǒng)。

下面我們來看一下task_struct的定義：
linux-src/include/linux/sched.h

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
    struct thread_info      thread_info;
#endif
    unsigned int            __state;
    void                *stack;
    unsigned int            flags;
    int             on_cpu;
    unsigned int            cpu;
    int             recent_used_cpu;
    int             wake_cpu;
    int             on_rq;
    int             prio;
    int             static_prio;
    int             normal_prio;
    unsigned int            rt_priority;
    const struct sched_class    *sched_class;
    struct sched_entity     se;
    struct sched_rt_entity      rt;
    struct sched_dl_entity      dl;
    unsigned int            policy;
    int             nr_cpus_allowed;
    cpumask_t           cpus_mask;
    struct sched_info       sched_info;
    struct list_head        tasks;

    struct mm_struct        *mm;
    struct mm_struct        *active_mm;

    struct vmacache         vmacache;

    int             exit_state;
    int             exit_code;
    int             exit_signal;

    pid_t               pid;
    pid_t               tgid;

    struct task_struct __rcu    *real_parent;
    struct task_struct __rcu    *parent;
    struct list_head        children;
    struct list_head        sibling;
    struct task_struct      *group_leader;

    unsigned long           nvcsw;
    unsigned long           nivcsw;

    u64             start_time;
    u64             start_boottime;

    unsigned long           min_flt;
    unsigned long           maj_flt;

    char                comm[TASK_COMM_LEN];

    struct fs_struct        *fs;
    struct files_struct     *files;

    struct signal_struct        *signal;
    struct sighand_struct __rcu     *sighand;
    sigset_t            blocked;
    sigset_t            real_blocked;
    sigset_t            saved_sigmask;
    struct sigpending       pending;

    struct thread_struct        thread;
};

這個(gè)結(jié)構(gòu)體定義有700多行,，本文把一些暫時(shí)用不到的都刪除了,，現(xiàn)在還有70多行，我們來看一下大概都有哪些內(nèi)容,。先看和進(jìn)程相關(guān)的,，首先最重要的是虛擬內(nèi)存空間信息mm、active_mm,，這兩個(gè)都是指針,，對(duì)于用戶線程來說兩個(gè)指針的值永遠(yuǎn)都是相同的，同一個(gè)進(jìn)程的所有線程都指向相同的mm,，這個(gè)值就表明了同一個(gè)進(jìn)程的線程都在同一個(gè)用戶空間,。其次比較重要的是文件管理相關(guān)的兩個(gè)字段fs和files，也都是指針,，fs代表的是文件系統(tǒng)掛載相關(guān)的,，這個(gè)不僅是同進(jìn)程的所有線程都相同，而且整個(gè)系統(tǒng)默認(rèn)的值都一樣,，除非使用了mount 命名空間,，files代表的是打開的文件資源，這個(gè)是同進(jìn)程的所有線程都相同,。然后我們?cè)賮砜匆幌滦盘?hào)相關(guān)的,，信號(hào)有的數(shù)據(jù)是進(jìn)程全局的，有的是線程私有的,，信號(hào)的處理是進(jìn)程全局的,，所以signal、sighand兩個(gè)字段都是指針,，同進(jìn)程的所有線程都指向同一個(gè)結(jié)構(gòu)體,，信號(hào)掩碼是線程私有的，所以blocked直接是內(nèi)嵌數(shù)據(jù),。進(jìn)程相關(guān)的數(shù)據(jù)基本就這些,，下面我們來看一下線程相關(guān)的數(shù)據(jù)。首先是進(jìn)程的運(yùn)行退出狀態(tài),，有幾個(gè)字段,，__state、on_cpu,、cpu,、exit_state、exit_code,、exit_signal,。然后是和線程調(diào)度相關(guān)的幾個(gè)字段,，有和優(yōu)先級(jí)相關(guān)的rt_priority、static_prio,、normal_prio,、prio，有和調(diào)度信息統(tǒng)計(jì)相關(guān)的兩個(gè)結(jié)構(gòu)體,，se,、sched_info,。還有兩個(gè)非常重要的字段我們下一節(jié)講,。

2.3 進(jìn)程標(biāo)識(shí)符

task_struct里面有兩個(gè)重要的字段pid、tgid,。我們?cè)谟脩艨臻g的時(shí)候也有pid,、tid，那么用戶空間的pid是不是就是內(nèi)核的pid呢,，那tgid又是啥呢,。很多初學(xué)內(nèi)核的人會(huì)認(rèn)為用戶空間的pid就是內(nèi)核的pid，剛開始我也是這么認(rèn)為的,，給我的內(nèi)核學(xué)習(xí)帶來了很大的困擾,。實(shí)際上用戶空間的tid是內(nèi)核空間pid，用戶空間的pid是內(nèi)核空間的tgid,，內(nèi)核空間的tgid是內(nèi)核里主線程的pid,。為什么會(huì)這樣呢？主要還是前面講的問題,，task_struct既是線程又是進(jìn)程的代理,，沒有單獨(dú)的進(jìn)程結(jié)構(gòu)體。當(dāng)進(jìn)程創(chuàng)建時(shí),，也就是進(jìn)程的第一個(gè)線程創(chuàng)建時(shí),，會(huì)為task_struct分配一個(gè)pid，就是主線程的tid,，然后進(jìn)程的pid也就是字段tgid會(huì)被賦值為主線程的tid,。此后再創(chuàng)建的線程都會(huì)繼承父線程的tgid，所以在每個(gè)線程中都能直接獲取進(jìn)程的pid,。

我們?cè)谶@里畫個(gè)圖總結(jié)一下進(jìn)程與線程的關(guān)系,、pid與tgid之間的關(guān)系：

Linux里面雖然沒有進(jìn)程結(jié)構(gòu)體，但是所有tgid相同,、虛擬內(nèi)存等資源相同的線程構(gòu)成一個(gè)虛擬的進(jìn)程結(jié)構(gòu)體,。創(chuàng)建進(jìn)程的第一個(gè)線程(task_struct)就是同時(shí)在創(chuàng)建進(jìn)程，其對(duì)應(yīng)的mm_struct,、files_struct,、signal_struct等資源都會(huì)被創(chuàng)建出來,。創(chuàng)建進(jìn)程的第二個(gè)線程那就是純粹地創(chuàng)建線程了。

2.4 進(jìn)程的狀態(tài)

進(jìn)程的狀態(tài)在Linux中是如何表示的呢,？task_struct中有兩個(gè)字段用來表示進(jìn)程的狀態(tài),，__state和exit_state，前者是總體狀態(tài),，后者是進(jìn)程在死亡時(shí)的兩個(gè)子狀態(tài),。

我們來看一下代碼中的定義：
linux-src/include/linux/sched.h

/* Used in tsk->state: */
#define TASK_RUNNING            0x0000
#define TASK_INTERRUPTIBLE        0x0001
#define TASK_UNINTERRUPTIBLE        0x0002
#define __TASK_STOPPED            0x0004
#define __TASK_TRACED            0x0008
/* Used in tsk->exit_state: */
#define EXIT_DEAD            0x0010
#define EXIT_ZOMBIE            0x0020
#define EXIT_TRACE            (EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_PARKED            0x0040
#define TASK_DEAD            0x0080
#define TASK_WAKEKILL            0x0100
#define TASK_WAKING            0x0200
#define TASK_NOLOAD            0x0400
#define TASK_NEW            0x0800

其中TASK_RUNNING代表的是Runnable和Running狀態(tài)。在Linux中不是用flag直接區(qū)分Runnable和Running狀態(tài)的,，它們都用TASK_RUNNING表示,，區(qū)分它們的方法是進(jìn)程是否在運(yùn)行隊(duì)列的當(dāng)前進(jìn)程字段上。Blocked狀態(tài)有兩種表示,，TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE,，它們的區(qū)別是前者在睡眠時(shí)能被信號(hào)喚醒，后者不能被信號(hào)喚醒,。表示死亡的狀態(tài)是TASK_DEAD,，它有兩個(gè)子狀態(tài)EXIT_ZOMBIE、EXIT_DEAD,，這兩個(gè)狀態(tài)在3.6中講解,。

三、進(jìn)程的生命周期

了解了進(jìn)程的基本概念,，明白了進(jìn)程在Linux中的實(shí)現(xiàn),，下面我們?cè)賮砜匆豢催M(jìn)程的生命周期。進(jìn)程的生命周期和進(jìn)程的五態(tài)轉(zhuǎn)化有關(guān)聯(lián),，但是又不完全相同,。我們先來回顧一下進(jìn)程的五態(tài)轉(zhuǎn)化圖。

進(jìn)程從無到有要經(jīng)歷新建的狀態(tài),，在Linux上創(chuàng)建進(jìn)程和加載程序是兩個(gè)不同的步驟,。剛創(chuàng)建出來的進(jìn)程和父進(jìn)程幾乎是一模一樣，要想執(zhí)行新的程序還得經(jīng)歷裝載的過程,。程序裝載完成之后就會(huì)進(jìn)入就緒,、執(zhí)行、阻塞的循環(huán)了,，這個(gè)是進(jìn)程調(diào)度里面的內(nèi)容,。實(shí)際上程序在main函數(shù)之前還經(jīng)歷了兩個(gè)過程，分別是so的加載和程序本身的初始化,。進(jìn)程執(zhí)行到最后總會(huì)經(jīng)歷死亡,，無論是主動(dòng)退出還是意外死亡。下面我們就詳細(xì)分析一下進(jìn)程的這幾個(gè)生命周期。

3.1 進(jìn)程的創(chuàng)建

Linux上創(chuàng)建進(jìn)程和我們直觀想象的不同,，我們一般想象的是有個(gè)類似create_process的系統(tǒng)調(diào)用,，可以直接創(chuàng)建進(jìn)程并執(zhí)行新的程序。但是在UNIX-like的系統(tǒng)上,，創(chuàng)建進(jìn)程和執(zhí)行新的程序是分開的,，fork是用來創(chuàng)建進(jìn)程的，創(chuàng)建的進(jìn)程和父進(jìn)程是同一個(gè)程序,，然后可以在子進(jìn)程中通過exec系統(tǒng)調(diào)用來執(zhí)行你想要執(zhí)行的程序,。UNIX為什么要這么設(shè)計(jì)呢？有兩個(gè)原因,，一是當(dāng)時(shí)還沒有多線程,，使用fork可以實(shí)現(xiàn)多進(jìn)程；二是fork之后可以進(jìn)行一些操作再用exec裝載新程序,，可以提高靈活性,。我們這節(jié)只講fork,，在下一節(jié)講exec,。

我們先來看一下fork的接口定義：

 #include <unistd.h>
pid_t fork(void);

fork系統(tǒng)調(diào)用不接受任何參數(shù)，返回值是個(gè)pid,。第一次接觸fork的人難免會(huì)有疑惑,，fork是怎么創(chuàng)建進(jìn)程的呢？答案是fork會(huì)返回兩次,，在父進(jìn)程中返回一次,，在子進(jìn)程中返回一次，在父進(jìn)程中返回的是子進(jìn)程的pid,，在子進(jìn)程中返回的是0,，如果創(chuàng)建進(jìn)程失敗則返回-1。估計(jì)很多人還是難以理解這是什么意思,。下面我們?cè)倥e個(gè)例子用代碼來演示一下,。

#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <signal.h>
#include <stdlib.h>


int main(int argc, char *argv[])
{

    pid_t pid = fork();

    if(pid == -1) {
        printf('fork error, exit\n');
        exit(-1);
    } else if(pid == 0) {
        printf('I am child process, pid:%d\n', getpid());
        pause();
    } else {
        printf('I am parent process, pid:%d, my child is pid:%d\n', getpid(), pid);
        waitpid(pid, NULL, 0);
    }
}

從這個(gè)例子中，我們可以看到fork的用法,，當(dāng)fork返回值為0時(shí)代表是子進(jìn)程,，我們可以在這里做一些要在子進(jìn)程中做的事。

那么fork系統(tǒng)調(diào)用是怎么實(shí)現(xiàn)的呢,？讓我們來看一下代碼：
linux-src/kernel/fork.c

SYSCALL_DEFINE0(fork)
{
    struct kernel_clone_args args = {
        .exit_signal = SIGCHLD,
    };

    return kernel_clone(&args);
}

pid_t kernel_clone(struct kernel_clone_args *args)
{
    u64 clone_flags = args->flags;
    struct completion vfork;
    struct pid *pid;
    struct task_struct *p;
    int trace = 0;
    pid_t nr;

    /*
     * For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
     * to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
     * mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
     * field in struct clone_args and it still doesn't make sense to have
     * them both point at the same memory location. Performing this check
     * here has the advantage that we don't need to have a separate helper
     * to check for legacy clone().
     */
    if ((args->flags & CLONE_PIDFD) &&
        (args->flags & CLONE_PARENT_SETTID) &&
        (args->pidfd == args->parent_tid))
        return -EINVAL;

    /*
     * Determine whether and which event to report to ptracer.  When
     * called from kernel_thread or CLONE_UNTRACED is explicitly
     * requested, no event is reported; otherwise, report if the event
     * for the type of forking is enabled.
     */
    if (!(clone_flags & CLONE_UNTRACED)) {
        if (clone_flags & CLONE_VFORK)
            trace = PTRACE_EVENT_VFORK;
        else if (args->exit_signal != SIGCHLD)
            trace = PTRACE_EVENT_CLONE;
        else
            trace = PTRACE_EVENT_FORK;

        if (likely(!ptrace_event_enabled(current, trace)))
            trace = 0;
    }

    p = copy_process(NULL, trace, NUMA_NO_NODE, args);
    add_latent_entropy();

    if (IS_ERR(p))
        return PTR_ERR(p);

    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    trace_sched_process_fork(current, p);

    pid = get_task_pid(p, PIDTYPE_PID);
    nr = pid_vnr(pid);

    if (clone_flags & CLONE_PARENT_SETTID)
        put_user(nr, args->parent_tid);

    if (clone_flags & CLONE_VFORK) {
        p->vfork_done = &vfork;
        init_completion(&vfork);
        get_task_struct(p);
    }

    wake_up_new_task(p);

    /* forking complete and child started to run, tell ptracer */
    if (unlikely(trace))
        ptrace_event_pid(trace, pid);

    if (clone_flags & CLONE_VFORK) {
        if (!wait_for_vfork_done(p, &vfork))
            ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
    }

    put_pid(pid);
    return nr;
}

內(nèi)核本身有fork的系統(tǒng)調(diào)用,，但是glibc的fork API是用clone系統(tǒng)調(diào)用來實(shí)現(xiàn)的，我們知道這一點(diǎn)就行了,，實(shí)際上它們最后調(diào)用的代碼還是一樣的,，所以我們還用fork系統(tǒng)調(diào)用來講解，沒有影響?？梢钥吹絝ork系統(tǒng)調(diào)用什么也沒做,，直接調(diào)用的kernel_clone函數(shù)，kernel_clone以前叫做do_fork,，現(xiàn)在改名了,。kernel_clone的邏輯也很簡(jiǎn)單，就是做了兩件事,，一是copy_process復(fù)制task_struct,，二是wake_up_new_task喚醒新進(jìn)程。copy_process會(huì)根據(jù)flag來決定新的task_struct是自己創(chuàng)建新的mm_struct,、files_struct等結(jié)構(gòu)體,，還是和父線程共享這些結(jié)構(gòu)體，由于我們這里是創(chuàng)建進(jìn)程,，所以這些結(jié)構(gòu)體都會(huì)創(chuàng)建新的,。系統(tǒng)調(diào)用執(zhí)行完成后就會(huì)返回，返回值是子進(jìn)程的pid,。而子進(jìn)程被wake_up之后會(huì)被調(diào)度執(zhí)行,，它返回到用戶空間時(shí)返回值是0。

3.2 進(jìn)程的裝載

新的進(jìn)程剛剛創(chuàng)建之后執(zhí)行的還是舊的程序,，想要執(zhí)行新的程序的話還得使用系統(tǒng)調(diào)用execve,。execve會(huì)把當(dāng)前程序替換為新的程序。下面我們先來看一下execve的接口：

#include <unistd.h>
int execve(const char *pathname, char *const argv[], char *const envp[]);

第一個(gè)參數(shù)是要執(zhí)行的程序的路徑,，可以是相對(duì)路徑也可以是絕對(duì)路徑,。第二個(gè)參數(shù)是程序的參數(shù)列表，我們?cè)诿钚袌?zhí)行命令時(shí)后面跟的參數(shù)會(huì)被放到這里,。第三個(gè)參數(shù)是環(huán)境變量列表,，在命令行執(zhí)行程序時(shí)bash會(huì)被自己的環(huán)境變量放到這里傳給子進(jìn)程。

除此之外,，libc還提供了幾個(gè)API可以用來執(zhí)行新的進(jìn)程,，它們的功能是一樣的，只是參數(shù)有所差異,，這些API的實(shí)現(xiàn)還是使用的系統(tǒng)調(diào)用execve,。

#include <unistd.h>
extern char **environ;
int execl(const char *pathname, const char *arg, ... /*, (char *) NULL */);
int execlp(const char *file, const char *arg, ... /*, (char *) NULL */);
int execle(const char *pathname, const char *arg, ... /*, (char *) NULL, char *const envp[] */);
int execv(const char *pathname, char *const argv[]);
int execvp(const char *file, char *const argv[]);
int execvpe(const char *file, char *const argv[], char *const envp[]);

下面我們來看一下execve系統(tǒng)調(diào)用的實(shí)現(xiàn)：
linux-src/fs/exec.c

SYSCALL_DEFINE3(execve,
        const char __user *, filename,
        const char __user *const __user *, argv,
        const char __user *const __user *, envp)
{
    return do_execve(getname(filename), argv, envp);
}

static int do_execve(struct filename *filename,
    const char __user *const __user *__argv,
    const char __user *const __user *__envp)
{
    struct user_arg_ptr argv = { .ptr.native = __argv };
    struct user_arg_ptr envp = { .ptr.native = __envp };
    return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
}

static int do_execveat_common(int fd, struct filename *filename,
                  struct user_arg_ptr argv,
                  struct user_arg_ptr envp,
                  int flags)
{
    struct linux_binprm *bprm;
    int retval;

    if (IS_ERR(filename))
        return PTR_ERR(filename);

    /*
     * We move the actual failure in case of RLIMIT_NPROC excess from
     * set*uid() to execve() because too many poorly written programs
     * don't check setuid() return code.  Here we additionally recheck
     * whether NPROC limit is still exceeded.
     */
    if ((current->flags & PF_NPROC_EXCEEDED) &&
        is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
        retval = -EAGAIN;
        goto out_ret;
    }

    /* We're below the limit (still or again), so we don't want to make
     * further execve() calls fail. */
    current->flags &= ~PF_NPROC_EXCEEDED;

    bprm = alloc_bprm(fd, filename);
    if (IS_ERR(bprm)) {
        retval = PTR_ERR(bprm);
        goto out_ret;
    }

    retval = count(argv, MAX_ARG_STRINGS);
    if (retval < 0)
        goto out_free;
    bprm->argc = retval;

    retval = count(envp, MAX_ARG_STRINGS);
    if (retval < 0)
        goto out_free;
    bprm->envc = retval;

    retval = bprm_stack_limits(bprm);
    if (retval < 0)
        goto out_free;

    retval = copy_string_kernel(bprm->filename, bprm);
    if (retval < 0)
        goto out_free;
    bprm->exec = bprm->p;

    retval = copy_strings(bprm->envc, envp, bprm);
    if (retval < 0)
        goto out_free;

    retval = copy_strings(bprm->argc, argv, bprm);
    if (retval < 0)
        goto out_free;

    retval = bprm_execve(bprm, fd, filename, flags);
out_free:
    free_bprm(bprm);

out_ret:
    putname(filename);
    return retval;
}

static int bprm_execve(struct linux_binprm *bprm,
               int fd, struct filename *filename, int flags)
{
    struct file *file;
    int retval;

    retval = prepare_bprm_creds(bprm);
    if (retval)
        return retval;

    check_unsafe_exec(bprm);
    current->in_execve = 1;

    file = do_open_execat(fd, filename, flags);
    retval = PTR_ERR(file);
    if (IS_ERR(file))
        goto out_unmark;

    sched_exec();

    bprm->file = file;
    /*
     * Record that a name derived from an O_CLOEXEC fd will be
     * inaccessible after exec.  This allows the code in exec to
     * choose to fail when the executable is not mmaped into the
     * interpreter and an open file descriptor is not passed to
     * the interpreter.  This makes for a better user experience
     * than having the interpreter start and then immediately fail
     * when it finds the executable is inaccessible.
     */
    if (bprm->fdpath && get_close_on_exec(fd))
        bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;

    /* Set the unchanging part of bprm->cred */
    retval = security_bprm_creds_for_exec(bprm);
    if (retval)
        goto out;

    retval = exec_binprm(bprm);
    if (retval < 0)
        goto out;

    /* execve succeeded */
    current->fs->in_exec = 0;
    current->in_execve = 0;
    rseq_execve(current);
    acct_update_integrals(current);
    task_numa_free(current, false);
    return retval;

out:
    /*
     * If past the point of no return ensure the code never
     * returns to the userspace process.  Use an existing fatal
     * signal if present otherwise terminate the process with
     * SIGSEGV.
     */
    if (bprm->point_of_no_return && !fatal_signal_pending(current))
        force_fatal_sig(SIGSEGV);

out_unmark:
    current->fs->in_exec = 0;
    current->in_execve = 0;

    return retval;
}

static int exec_binprm(struct linux_binprm *bprm)
{
    pid_t old_pid, old_vpid;
    int ret, depth;

    /* Need to fetch pid before load_binary changes it */
    old_pid = current->pid;
    rcu_read_lock();
    old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
    rcu_read_unlock();

    /* This allows 4 levels of binfmt rewrites before failing hard. */
    for (depth = 0;; depth++) {
        struct file *exec;
        if (depth > 5)
            return -ELOOP;

        ret = search_binary_handler(bprm);
        if (ret < 0)
            return ret;
        if (!bprm->interpreter)
            break;

        exec = bprm->file;
        bprm->file = bprm->interpreter;
        bprm->interpreter = NULL;

        allow_write_access(exec);
        if (unlikely(bprm->have_execfd)) {
            if (bprm->executable) {
                fput(exec);
                return -ENOEXEC;
            }
            bprm->executable = exec;
        } else
            fput(exec);
    }

    audit_bprm(bprm);
    trace_sched_process_exec(current, old_pid, bprm);
    ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
    proc_exec_connector(current);
    return 0;
}

static int search_binary_handler(struct linux_binprm *bprm)
{
    bool need_retry = IS_ENABLED(CONFIG_MODULES);
    struct linux_binfmt *fmt;
    int retval;

    retval = prepare_binprm(bprm);
    if (retval < 0)
        return retval;

    retval = security_bprm_check(bprm);
    if (retval)
        return retval;

    retval = -ENOENT;
 retry:
    read_lock(&binfmt_lock);
    list_for_each_entry(fmt, &formats, lh) {
        if (!try_module_get(fmt->module))
            continue;
        read_unlock(&binfmt_lock);

        retval = fmt->load_binary(bprm);

        read_lock(&binfmt_lock);
        put_binfmt(fmt);
        if (bprm->point_of_no_return || (retval != -ENOEXEC)) {
            read_unlock(&binfmt_lock);
            return retval;
        }
    }
    read_unlock(&binfmt_lock);

    if (need_retry) {
        if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
            printable(bprm->buf[2]) && printable(bprm->buf[3]))
            return retval;
        if (request_module('binfmt-%04x', *(ushort *)(bprm->buf + 2)) < 0)
            return retval;
        need_retry = false;
        goto retry;
    }

    return retval;
}

linux-src/fs/binfmt_elf.c

static int load_elf_binary(struct linux_binprm *bprm)
{
    struct file *interpreter = NULL; /* to shut gcc up */
     unsigned long load_addr = 0, load_bias = 0;
    int load_addr_set = 0;
    unsigned long error;
    struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
    struct elf_phdr *elf_property_phdata = NULL;
    unsigned long elf_bss, elf_brk;
    int bss_prot = 0;
    int retval, i;
    unsigned long elf_entry;
    unsigned long e_entry;
    unsigned long interp_load_addr = 0;
    unsigned long start_code, end_code, start_data, end_data;
    unsigned long reloc_func_desc __maybe_unused = 0;
    int executable_stack = EXSTACK_DEFAULT;
    struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf;
    struct elfhdr *interp_elf_ex = NULL;
    struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
    struct mm_struct *mm;
    struct pt_regs *regs;

    retval = -ENOEXEC;
    /* First of all, some simple consistency checks */
    if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
        goto out;

    if (elf_ex->e_type != ET_EXEC && elf_ex->e_type != ET_DYN)
        goto out;
    if (!elf_check_arch(elf_ex))
        goto out;
    if (elf_check_fdpic(elf_ex))
        goto out;
    if (!bprm->file->f_op->mmap)
        goto out;

    elf_phdata = load_elf_phdrs(elf_ex, bprm->file);
    if (!elf_phdata)
        goto out;

    elf_ppnt = elf_phdata;
    for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
        char *elf_interpreter;

        if (elf_ppnt->p_type == PT_GNU_PROPERTY) {
            elf_property_phdata = elf_ppnt;
            continue;
        }

        if (elf_ppnt->p_type != PT_INTERP)
            continue;

        /*
         * This is the program interpreter used for shared libraries -           * for now assume that this is an a.out format binary.
         */
        retval = -ENOEXEC;
        if (elf_ppnt->p_filesz > PATH_MAX || elf_ppnt->p_filesz < 2)
            goto out_free_ph;

        retval = -ENOMEM;
        elf_interpreter = kmalloc(elf_ppnt->p_filesz, GFP_KERNEL);
        if (!elf_interpreter)
            goto out_free_ph;

        retval = elf_read(bprm->file, elf_interpreter, elf_ppnt->p_filesz,
                  elf_ppnt->p_offset);
        if (retval < 0)
            goto out_free_interp;
        /* make sure path is NULL terminated */
        retval = -ENOEXEC;
        if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
            goto out_free_interp;

        interpreter = open_exec(elf_interpreter);
        kfree(elf_interpreter);
        retval = PTR_ERR(interpreter);
        if (IS_ERR(interpreter))
            goto out_free_ph;

        /*
         * If the binary is not readable then enforce mm->dumpable = 0
         * regardless of the interpreter's permissions.
         */
        would_dump(bprm, interpreter);

        interp_elf_ex = kmalloc(sizeof(*interp_elf_ex), GFP_KERNEL);
        if (!interp_elf_ex) {
            retval = -ENOMEM;
            goto out_free_ph;
        }

        /* Get the exec headers */
        retval = elf_read(interpreter, interp_elf_ex,
                  sizeof(*interp_elf_ex), 0);
        if (retval < 0)
            goto out_free_dentry;

        break;

out_free_interp:
        kfree(elf_interpreter);
        goto out_free_ph;
    }

    elf_ppnt = elf_phdata;
    for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++)
        switch (elf_ppnt->p_type) {
        case PT_GNU_STACK:
            if (elf_ppnt->p_flags & PF_X)
                executable_stack = EXSTACK_ENABLE_X;
            else
                executable_stack = EXSTACK_DISABLE_X;
            break;

        case PT_LOPROC ... PT_HIPROC:
            retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
                          bprm->file, false,
                          &arch_state);
            if (retval)
                goto out_free_dentry;
            break;
        }

    /* Some simple consistency checks for the interpreter */
    if (interpreter) {
        retval = -ELIBBAD;
        /* Not an ELF interpreter */
        if (memcmp(interp_elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
            goto out_free_dentry;
        /* Verify the interpreter has a valid arch */
        if (!elf_check_arch(interp_elf_ex) ||
            elf_check_fdpic(interp_elf_ex))
            goto out_free_dentry;

        /* Load the interpreter program headers */
        interp_elf_phdata = load_elf_phdrs(interp_elf_ex,
                           interpreter);
        if (!interp_elf_phdata)
            goto out_free_dentry;

        /* Pass PT_LOPROC..PT_HIPROC headers to arch code */
        elf_property_phdata = NULL;
        elf_ppnt = interp_elf_phdata;
        for (i = 0; i < interp_elf_ex->e_phnum; i++, elf_ppnt++)
            switch (elf_ppnt->p_type) {
            case PT_GNU_PROPERTY:
                elf_property_phdata = elf_ppnt;
                break;

            case PT_LOPROC ... PT_HIPROC:
                retval = arch_elf_pt_proc(interp_elf_ex,
                              elf_ppnt, interpreter,
                              true, &arch_state);
                if (retval)
                    goto out_free_dentry;
                break;
            }
    }

    retval = parse_elf_properties(interpreter ?: bprm->file,
                      elf_property_phdata, &arch_state);
    if (retval)
        goto out_free_dentry;

    /*
     * Allow arch code to reject the ELF at this point, whilst it's
     * still possible to return an error to the code that invoked
     * the exec syscall.
     */
    retval = arch_check_elf(elf_ex,
                !!interpreter, interp_elf_ex,
                &arch_state);
    if (retval)
        goto out_free_dentry;

    /* Flush all traces of the currently running executable */
    retval = begin_new_exec(bprm);
    if (retval)
        goto out_free_dentry;

    /* Do this immediately, since STACK_TOP as used in setup_arg_pages
       may depend on the personality.  */
    SET_PERSONALITY2(*elf_ex, &arch_state);
    if (elf_read_implies_exec(*elf_ex, executable_stack))
        current->personality |= READ_IMPLIES_EXEC;

    if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
        current->flags |= PF_RANDOMIZE;

    setup_new_exec(bprm);

    /* Do this so that we can load the interpreter, if need be.  We will
       change some of these later */
    retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
                 executable_stack);
    if (retval < 0)
        goto out_free_dentry;

    elf_bss = 0;
    elf_brk = 0;

    start_code = ~0UL;
    end_code = 0;
    start_data = 0;
    end_data = 0;

    /* Now we do a little grungy work by mmapping the ELF image into
       the correct location in memory. */
    for(i = 0, elf_ppnt = elf_phdata;
        i < elf_ex->e_phnum; i++, elf_ppnt++) {
        int elf_prot, elf_flags;
        unsigned long k, vaddr;
        unsigned long total_size = 0;
        unsigned long alignment;

        if (elf_ppnt->p_type != PT_LOAD)
            continue;

        if (unlikely (elf_brk > elf_bss)) {
            unsigned long nbyte;

            /* There was a PT_LOAD segment with p_memsz > p_filesz
               before this one. Map anonymous pages, if needed,
               and clear the area.  */
            retval = set_brk(elf_bss + load_bias,
                     elf_brk + load_bias,
                     bss_prot);
            if (retval)
                goto out_free_dentry;
            nbyte = ELF_PAGEOFFSET(elf_bss);
            if (nbyte) {
                nbyte = ELF_MIN_ALIGN - nbyte;
                if (nbyte > elf_brk - elf_bss)
                    nbyte = elf_brk - elf_bss;
                if (clear_user((void __user *)elf_bss +
                            load_bias, nbyte)) {
                    /*
                     * This bss-zeroing can fail if the ELF
                     * file specifies odd protections. So
                     * we don't check the return value
                     */
                }
            }
        }

        elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
                     !!interpreter, false);

        elf_flags = MAP_PRIVATE;

        vaddr = elf_ppnt->p_vaddr;
        /*
         * If we are loading ET_EXEC or we have already performed
         * the ET_DYN load_addr calculations, proceed normally.
         */
        if (elf_ex->e_type == ET_EXEC || load_addr_set) {
            elf_flags |= MAP_FIXED;
        } else if (elf_ex->e_type == ET_DYN) {
            /*
             * This logic is run once for the first LOAD Program
             * Header for ET_DYN binaries to calculate the
             * randomization (load_bias) for all the LOAD
             * Program Headers, and to calculate the entire
             * size of the ELF mapping (total_size). (Note that
             * load_addr_set is set to true later once the
             * initial mapping is performed.)
             *
             * There are effectively two types of ET_DYN
             * binaries: programs (i.e. PIE: ET_DYN with INTERP)
             * and loaders (ET_DYN without INTERP, since they
             * _are_ the ELF interpreter). The loaders must
             * be loaded away from programs since the program
             * may otherwise collide with the loader (especially
             * for ET_EXEC which does not have a randomized
             * position). For example to handle invocations of
             * './ld.so someprog' to test out a new version of
             * the loader, the subsequent program that the
             * loader loads must avoid the loader itself, so
             * they cannot share the same load range. Sufficient
             * room for the brk must be allocated with the
             * loader as well, since brk must be available with
             * the loader.
             *
             * Therefore, programs are loaded offset from
             * ELF_ET_DYN_BASE and loaders are loaded into the
             * independently randomized mmap region (0 load_bias
             * without MAP_FIXED).
             */
            if (interpreter) {
                load_bias = ELF_ET_DYN_BASE;
                if (current->flags & PF_RANDOMIZE)
                    load_bias += arch_mmap_rnd();
                alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
                if (alignment)
                    load_bias &= ~(alignment - 1);
                elf_flags |= MAP_FIXED;
            } else
                load_bias = 0;

            /*
             * Since load_bias is used for all subsequent loading
             * calculations, we must lower it by the first vaddr
             * so that the remaining calculations based on the
             * ELF vaddrs will be correctly offset. The result
             * is then page aligned.
             */
            load_bias = ELF_PAGESTART(load_bias - vaddr);

            total_size = total_mapping_size(elf_phdata,
                            elf_ex->e_phnum);
            if (!total_size) {
                retval = -EINVAL;
                goto out_free_dentry;
            }
        }

        error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
                elf_prot, elf_flags, total_size);
        if (BAD_ADDR(error)) {
            retval = IS_ERR((void *)error) ?
                PTR_ERR((void*)error) : -EINVAL;
            goto out_free_dentry;
        }

        if (!load_addr_set) {
            load_addr_set = 1;
            load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset);
            if (elf_ex->e_type == ET_DYN) {
                load_bias += error -                                 ELF_PAGESTART(load_bias + vaddr);
                load_addr += load_bias;
                reloc_func_desc = load_bias;
            }
        }
        k = elf_ppnt->p_vaddr;
        if ((elf_ppnt->p_flags & PF_X) && k < start_code)
            start_code = k;
        if (start_data < k)
            start_data = k;

        /*
         * Check to see if the section's size will overflow the
         * allowed task size. Note that p_filesz must always be
         * <= p_memsz so it is only necessary to check p_memsz.
         */
        if (BAD_ADDR(k) || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
            elf_ppnt->p_memsz > TASK_SIZE ||
            TASK_SIZE - elf_ppnt->p_memsz < k) {
            /* set_brk can never work. Avoid overflows. */
            retval = -EINVAL;
            goto out_free_dentry;
        }

        k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;

        if (k > elf_bss)
            elf_bss = k;
        if ((elf_ppnt->p_flags & PF_X) && end_code < k)
            end_code = k;
        if (end_data < k)
            end_data = k;
        k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
        if (k > elf_brk) {
            bss_prot = elf_prot;
            elf_brk = k;
        }
    }

    e_entry = elf_ex->e_entry + load_bias;
    elf_bss += load_bias;
    elf_brk += load_bias;
    start_code += load_bias;
    end_code += load_bias;
    start_data += load_bias;
    end_data += load_bias;

    /* Calling set_brk effectively mmaps the pages that we need
     * for the bss and break sections.  We must do this before
     * mapping in the interpreter, to make sure it doesn't wind
     * up getting placed where the bss needs to go.
     */
    retval = set_brk(elf_bss, elf_brk, bss_prot);
    if (retval)
        goto out_free_dentry;
    if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
        retval = -EFAULT; /* Nobody gets to see this, but.. */
        goto out_free_dentry;
    }

    if (interpreter) {
        elf_entry = load_elf_interp(interp_elf_ex,
                        interpreter,
                        load_bias, interp_elf_phdata,
                        &arch_state);
        if (!IS_ERR((void *)elf_entry)) {
            /*
             * load_elf_interp() returns relocation
             * adjustment
             */
            interp_load_addr = elf_entry;
            elf_entry += interp_elf_ex->e_entry;
        }
        if (BAD_ADDR(elf_entry)) {
            retval = IS_ERR((void *)elf_entry) ?
                    (int)elf_entry : -EINVAL;
            goto out_free_dentry;
        }
        reloc_func_desc = interp_load_addr;

        allow_write_access(interpreter);
        fput(interpreter);

        kfree(interp_elf_ex);
        kfree(interp_elf_phdata);
    } else {
        elf_entry = e_entry;
        if (BAD_ADDR(elf_entry)) {
            retval = -EINVAL;
            goto out_free_dentry;
        }
    }

    kfree(elf_phdata);

    set_binfmt(&elf_format);

#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
    retval = ARCH_SETUP_ADDITIONAL_PAGES(bprm, elf_ex, !!interpreter);
    if (retval < 0)
        goto out;
#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */

    retval = create_elf_tables(bprm, elf_ex,
              load_addr, interp_load_addr, e_entry);
    if (retval < 0)
        goto out;

    mm = current->mm;
    mm->end_code = end_code;
    mm->start_code = start_code;
    mm->start_data = start_data;
    mm->end_data = end_data;
    mm->start_stack = bprm->p;

    if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
        /*
         * For architectures with ELF randomization, when executing
         * a loader directly (i.e. no interpreter listed in ELF
         * headers), move the brk area out of the mmap region
         * (since it grows up, and may collide early with the stack
         * growing down), and into the unused ELF_ET_DYN_BASE region.
         */
        if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
            elf_ex->e_type == ET_DYN && !interpreter) {
            mm->brk = mm->start_brk = ELF_ET_DYN_BASE;
        }

        mm->brk = mm->start_brk = arch_randomize_brk(mm);
#ifdef compat_brk_randomized
        current->brk_randomized = 1;
#endif
    }

    if (current->personality & MMAP_PAGE_ZERO) {
        /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
           and some applications 'depend' upon this behavior.
           Since we do not have the power to recompile these, we
           emulate the SVr4 behavior. Sigh. */
        error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC,
                MAP_FIXED | MAP_PRIVATE, 0);
    }

    regs = current_pt_regs();
#ifdef ELF_PLAT_INIT
    /*
     * The ABI may specify that certain registers be set up in special
     * ways (on i386 %edx is the address of a DT_FINI function, for
     * example.  In addition, it may also specify (eg, PowerPC64 ELF)
     * that the e_entry field is the address of the function descriptor
     * for the startup routine, rather than the address of the startup
     * routine itself.  This macro performs whatever initialization to
     * the regs structure is required as well as any relocations to the
     * function descriptor entries when executing dynamically links apps.
     */
    ELF_PLAT_INIT(regs, reloc_func_desc);
#endif

    finalize_exec(bprm);
    START_THREAD(elf_ex, regs, elf_entry, bprm->p);
    retval = 0;
out:
    return retval;

    /* error cleanup */
out_free_dentry:
    kfree(interp_elf_ex);
    kfree(interp_elf_phdata);
    allow_write_access(interpreter);
    if (interpreter)
        fput(interpreter);
out_free_ph:
    kfree(elf_phdata);
    goto out;
}

execve系統(tǒng)調(diào)用的邏輯比較復(fù)雜，這里就簡(jiǎn)單解析一下,。函數(shù)首先會(huì)調(diào)用alloc_bprm分配一個(gè)linux_binprm結(jié)構(gòu)體,，這個(gè)結(jié)構(gòu)體記錄著可執(zhí)行程序的一些信息。在alloc_bprm會(huì)創(chuàng)建一個(gè)新的mm_struct,，此后進(jìn)程就會(huì)用這個(gè)新的虛擬內(nèi)存空間了,，還會(huì)創(chuàng)建一個(gè)vma作為主線程的棧，初始大小為4k。然后調(diào)用bprm_execve,，bprm_execve會(huì)調(diào)用exec_binprm,，exec_binprm會(huì)調(diào)用search_binary_handler，在search_binary_handler里會(huì)通過函數(shù)指針load_binary調(diào)用最后的函數(shù)load_elf_binary,。

在load_elf_binary里,，會(huì)先對(duì)ELF文件頭部信息進(jìn)行解析。然后會(huì)加載解釋器(interpreter),。什么是解釋器呢,？一個(gè)程序往往并不是只有可執(zhí)行程序，而是由一個(gè)可執(zhí)行程序加上n個(gè)so組成,。so是在程序啟動(dòng)時(shí)動(dòng)態(tài)加載的,，可能很多人會(huì)認(rèn)為這個(gè)工作是由內(nèi)核完成的，實(shí)際上這個(gè)工作是由一個(gè)so完成的,，這個(gè)so就叫做程序解釋器,，在教科書上往往被叫做加載器，也有叫動(dòng)態(tài)鏈接器的,。X86_64上的解釋器文件是/lib64/ld-linux-x86-64.so.2,，這一般是個(gè)軟連接文件，它會(huì)指向真正的解釋器,。內(nèi)核負(fù)責(zé)加載解釋器,，解釋器負(fù)責(zé)加載所有其它的so,。很多人可能認(rèn)為進(jìn)程返回用戶空間之后就要直接執(zhí)行main函數(shù)了,，實(shí)際上還早著呢。進(jìn)程返回用戶空間首先執(zhí)行的是解釋器的入口函數(shù),，解釋器執(zhí)行完了之后會(huì)執(zhí)行可執(zhí)行程序的入口函數(shù),，入口函數(shù)執(zhí)行完了之后才會(huì)去執(zhí)行main函數(shù)。這個(gè)正是我們下面兩節(jié)要講的內(nèi)容,。

3.3 進(jìn)程的加載

這一節(jié)要講的是解釋器的加載過程,，這個(gè)過程也被叫做動(dòng)態(tài)鏈接。加載器的實(shí)現(xiàn)是在Glibc里面,。我們這里就是大概介紹一下加載器的邏輯,，具體的細(xì)節(jié)大家可以去看參考文獻(xiàn)中的書籍。ELF格式的可執(zhí)行程序和共享庫里面有一個(gè)段叫做.dynamic,，這個(gè)段里面會(huì)記錄程序所依賴的所有so,。so里面的.dynamic段也會(huì)記錄自己所依賴的所有so。解釋器會(huì)通過深度優(yōu)先或者廣度優(yōu)先的方法找到一個(gè)程序所依賴的所有so,，然后加載它們,。加載一個(gè)so會(huì)首先解析它的ELF頭部信息，然后通過mmap為它的數(shù)據(jù)段代碼段分配內(nèi)存，并設(shè)置不同的讀寫執(zhí)行權(quán)限,。最后會(huì)對(duì)so進(jìn)行重定位,，重定位包括全局?jǐn)?shù)據(jù)重定位和函數(shù)重定位。

3.4 進(jìn)程的初始化

進(jìn)程完成加載之后不是直接就執(zhí)行main函數(shù)的,，而是會(huì)執(zhí)行ELF文件的入口函數(shù),。這個(gè)入口函數(shù)叫做_start，_start完成一些基本的設(shè)置之后會(huì)調(diào)用__libc_start_main,。__libc_start_main完成一些初始化之后才會(huì)調(diào)用main函數(shù),。你會(huì)發(fā)現(xiàn)，我們上學(xué)的時(shí)候講的是程序執(zhí)行的時(shí)候會(huì)首先執(zhí)行main函數(shù),，實(shí)際上在main函數(shù)執(zhí)行之前還發(fā)生了很多很復(fù)雜的事情,，只不過這些事情系統(tǒng)都幫我們悄悄地做了，如果我們想要研究透徹的話還是很麻煩的,。

__libc_start_main的具體細(xì)節(jié)請(qǐng)參看：
http:///tutorials/debugging/linuxProgramStartup.html

3.5 進(jìn)程的運(yùn)行

程序在運(yùn)行的時(shí)候會(huì)不停地經(jīng)歷就緒,、運(yùn)行、阻塞的過程,，具體情況請(qǐng)參看《深入理解Linux進(jìn)程調(diào)度》,。

3.6 進(jìn)程的死亡

進(jìn)程執(zhí)行到最后總會(huì)死亡的，進(jìn)程死亡的原因可以分為兩大類,，正常死亡和非正常死亡,。

正常死亡的情況有：
1.main函數(shù)返回。
2.進(jìn)程調(diào)用了exit,、_exit,、_Exit 等函數(shù)。
3.進(jìn)程的所有線程都調(diào)用了pthread_exit,。
4.主線程調(diào)用了pthread_exit,，其他線程都從線程函數(shù)返回。

非正常死亡的情況有：
1.進(jìn)程訪問非法內(nèi)存而收到信號(hào)SIGSEGV,。
2.庫程序發(fā)現(xiàn)異常情況給進(jìn)程發(fā)送信號(hào)SIGABRT,。
3.在終端上輸入Ctrl+C給進(jìn)程發(fā)送信號(hào)SIGINT。
4.通過kill命令給進(jìn)程發(fā)送信號(hào)SIGTERM,。
5.通過kill -9命令給進(jìn)程發(fā)送信號(hào)SIGKILL,。
6.進(jìn)程收到其它一些會(huì)導(dǎo)致死亡的信號(hào)。

main函數(shù)返回本質(zhì)上也是調(diào)用的exit,，因?yàn)閙ain函數(shù)外還有一層函數(shù)__libc_start_main,，它會(huì)在main函數(shù)返回后調(diào)用exit。exit的實(shí)現(xiàn)調(diào)用的是系統(tǒng)調(diào)用exit_group,，pthread_exit的實(shí)現(xiàn)調(diào)用的是系統(tǒng)調(diào)用exit,。這里就體現(xiàn)出了API和系統(tǒng)調(diào)用的不同,。進(jìn)程由于信號(hào)原因而死的，其死亡方法也是內(nèi)核在信號(hào)處理中調(diào)用了系統(tǒng)調(diào)用exit_group,，只不過是直接調(diào)用的函數(shù),，沒有走系統(tǒng)調(diào)用的流程。系統(tǒng)調(diào)用exit的作用是殺死線程,，系統(tǒng)調(diào)用exit_group的作用是殺死當(dāng)前線程,，并給同進(jìn)程的所有其它線程發(fā)SIGKILL信號(hào)，這會(huì)導(dǎo)致所有的線程都死亡,，從而整個(gè)進(jìn)程也就死了,。每個(gè)線程死亡的時(shí)候都會(huì)釋放對(duì)進(jìn)程資源的引用，最后一個(gè)線程死亡的時(shí)候,，資源的引用計(jì)數(shù)會(huì)變成0,，從而會(huì)去釋放這個(gè)資源?？偨Y(jié)一下就是進(jìn)程的第一個(gè)線程創(chuàng)建的時(shí)候會(huì)去創(chuàng)建進(jìn)程的資源,，進(jìn)程的最后一個(gè)線程死亡的時(shí)候會(huì)去釋放進(jìn)程的資源。

進(jìn)程死亡的過程可以細(xì)分為兩步,，僵尸和火化,，對(duì)應(yīng)著進(jìn)程死亡的兩個(gè)子狀態(tài)EXIT_ZOMBIE和EXIT_DEAD。進(jìn)程只有被火化之后才算是徹底死亡了,。就像人死了需要被家屬送去殯儀館火化并注銷戶口一樣,，進(jìn)程死了也需要親屬送去火化并注銷戶口。僵尸進(jìn)程雖然已經(jīng)死了,，但是并沒有火化和注銷戶口,，此時(shí)進(jìn)程的各種狀態(tài)信息還能查得到，進(jìn)程被火化之后其戶口也就自動(dòng)注銷了,，內(nèi)核中的相關(guān)函數(shù),、proc文件系統(tǒng)以及ps命令就查不到它的信息了。對(duì)于進(jìn)程來說只有父進(jìn)程有權(quán)力去火化子進(jìn)程,，如果父進(jìn)程一直不去火化子進(jìn)程，那么子進(jìn)程就會(huì)一直處于僵尸狀態(tài),。父進(jìn)程火化子進(jìn)程的方法的是什么呢,？就是系統(tǒng)調(diào)用wait、waitid,、waitpid,、wait3、wait4,。如果父進(jìn)程提前死了怎么辦呢,？子進(jìn)程會(huì)被托孤給init進(jìn)程,，由init進(jìn)程負(fù)責(zé)對(duì)其火化。任何進(jìn)程死亡都會(huì)經(jīng)歷僵尸狀態(tài),，只不過大部分情況下這個(gè)狀態(tài)持續(xù)時(shí)間都非常短,，用戶空間感覺不到。當(dāng)父進(jìn)程沒有對(duì)子進(jìn)程wait的時(shí)候,，子進(jìn)程就會(huì)一直處于僵尸狀態(tài),，不會(huì)被火化，這時(shí)候用戶空間通過ps命令就可以看到僵尸狀態(tài)的進(jìn)程了,。僵尸進(jìn)程不是沒有死,，而是死了沒人送去火化，所以殺死僵尸進(jìn)程的說法是不對(duì)的,。清理僵尸進(jìn)程的方法是kill其父進(jìn)程,，父進(jìn)程死了，僵尸進(jìn)程會(huì)被托孤給init進(jìn)程,，init進(jìn)程會(huì)立馬對(duì)其進(jìn)行火化,。

當(dāng)一個(gè)進(jìn)程的exit_group執(zhí)行完成之后，這個(gè)進(jìn)程就變成了僵尸進(jìn)程,。僵尸進(jìn)程是沒有用戶空間的,，也不可能再執(zhí)行了。僵尸進(jìn)程的文件等所有資源都被釋放了,，唯一剩下的就是還有一個(gè)task_struct結(jié)構(gòu)體,。如果父進(jìn)程此時(shí)去wait子進(jìn)程或者之前就已經(jīng)在wait子進(jìn)程，此時(shí)wait會(huì)返回,，task_struct會(huì)被銷毀,，這個(gè)進(jìn)程就徹底消失了。

下面然我們來看看exit_group系統(tǒng)調(diào)用的代碼：
linux-src/kernel/exit.c

SYSCALL_DEFINE1(exit_group, int, error_code)
{
    do_group_exit((error_code & 0xff) << 8);
    /* NOTREACHED */
    return 0;
}

void
do_group_exit(int exit_code)
{
    struct signal_struct *sig = current->signal;

    BUG_ON(exit_code & 0x80); /* core dumps don't get here */

    if (signal_group_exit(sig))
        exit_code = sig->group_exit_code;
    else if (!thread_group_empty(current)) {
        struct sighand_struct *const sighand = current->sighand;

        spin_lock_irq(&sighand->siglock);
        if (signal_group_exit(sig))
            /* Another thread got here before we took the lock.  */
            exit_code = sig->group_exit_code;
        else {
            sig->group_exit_code = exit_code;
            sig->flags = SIGNAL_GROUP_EXIT;
            zap_other_threads(current);
        }
        spin_unlock_irq(&sighand->siglock);
    }

    do_exit(exit_code);
    /* NOTREACHED */
}

void __noreturn do_exit(long code)
{
    struct task_struct *tsk = current;
    int group_dead;

    /*
     * We can get here from a kernel oops, sometimes with preemption off.
     * Start by checking for critical errors.
     * Then fix up important state like USER_DS and preemption.
     * Then do everything else.
     */

    WARN_ON(blk_needs_flush_plug(tsk));

    if (unlikely(in_interrupt()))
        panic('Aiee, killing interrupt handler!');
    if (unlikely(!tsk->pid))
        panic('Attempted to kill the idle task!');

    /*
     * If do_exit is called because this processes oopsed, it's possible
     * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
     * continuing. Amongst other possible reasons, this is to prevent
     * mm_release()->clear_child_tid() from writing to a user-controlled
     * kernel address.
     */
    force_uaccess_begin();

    if (unlikely(in_atomic())) {
        pr_info('note: %s[%d] exited with preempt_count %d\n',
            current->comm, task_pid_nr(current),
            preempt_count());
        preempt_count_set(PREEMPT_ENABLED);
    }

    profile_task_exit(tsk);
    kcov_task_exit(tsk);

    ptrace_event(PTRACE_EVENT_EXIT, code);

    validate_creds_for_do_exit(tsk);

    /*
     * We're taking recursive faults here in do_exit. Safest is to just
     * leave this task alone and wait for reboot.
     */
    if (unlikely(tsk->flags & PF_EXITING)) {
        pr_alert('Fixing recursive fault but reboot is needed!\n');
        futex_exit_recursive(tsk);
        set_current_state(TASK_UNINTERRUPTIBLE);
        schedule();
    }

    io_uring_files_cancel();
    exit_signals(tsk);  /* sets PF_EXITING */

    /* sync mm's RSS info before statistics gathering */
    if (tsk->mm)
        sync_mm_rss(tsk->mm);
    acct_update_integrals(tsk);
    group_dead = atomic_dec_and_test(&tsk->signal->live);
    if (group_dead) {
        /*
         * If the last thread of global init has exited, panic
         * immediately to get a useable coredump.
         */
        if (unlikely(is_global_init(tsk)))
            panic('Attempted to kill init! exitcode=0x%08x\n',
                tsk->signal->group_exit_code ?: (int)code);

#ifdef CONFIG_POSIX_TIMERS
        hrtimer_cancel(&tsk->signal->real_timer);
        exit_itimers(tsk->signal);
#endif
        if (tsk->mm)
            setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
    }
    acct_collect(code, group_dead);
    if (group_dead)
        tty_audit_exit();
    audit_free(tsk);

    tsk->exit_code = code;
    taskstats_exit(tsk, group_dead);

    exit_mm();

    if (group_dead)
        acct_process();
    trace_sched_process_exit(tsk);

    exit_sem(tsk);
    exit_shm(tsk);
    exit_files(tsk);
    exit_fs(tsk);
    if (group_dead)
        disassociate_ctty(1);
    exit_task_namespaces(tsk);
    exit_task_work(tsk);
    exit_thread(tsk);

    /*
     * Flush inherited counters to the parent - before the parent
     * gets woken up by child-exit notifications.
     *
     * because of cgroup mode, must be called before cgroup_exit()
     */
    perf_event_exit_task(tsk);

    sched_autogroup_exit_task(tsk);
    cgroup_exit(tsk);

    /*
     * FIXME: do that only when needed, using sched_exit tracepoint
     */
    flush_ptrace_hw_breakpoint(tsk);

    exit_tasks_rcu_start();
    exit_notify(tsk, group_dead);
    proc_exit_connector(tsk);
    mpol_put_task_policy(tsk);
#ifdef CONFIG_FUTEX
    if (unlikely(current->pi_state_cache))
        kfree(current->pi_state_cache);
#endif
    /*
     * Make sure we are holding no locks:
     */
    debug_check_no_locks_held();

    if (tsk->io_context)
        exit_io_context(tsk);

    if (tsk->splice_pipe)
        free_pipe_info(tsk->splice_pipe);

    if (tsk->task_frag.page)
        put_page(tsk->task_frag.page);

    validate_creds_for_do_exit(tsk);

    check_stack_usage();
    preempt_disable();
    if (tsk->nr_dirtied)
        __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
    exit_rcu();
    exit_tasks_rcu_finish();

    lockdep_free_task(tsk);
    do_task_dead();
}

linux-src/kernel/signal.c

int zap_other_threads(struct task_struct *p)
{
    struct task_struct *t = p;
    int count = 0;

    p->signal->group_stop_count = 0;

    while_each_thread(p, t) {
        task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
        count++;

        /* Don't bother with already dead threads */
        if (t->exit_state)
            continue;
        sigaddset(&t->pending.signal, SIGKILL);
        signal_wake_up(t, 1);
    }

    return count;
}

四,、回顧總結(jié)

在本文中我們學(xué)習(xí)了進(jìn)程的基本概念,，知道了進(jìn)程在Linux上是怎么實(shí)現(xiàn)的，也明白了進(jìn)程的各個(gè)生命周期的活動(dòng),。下面我們?cè)賮砜匆幌逻M(jìn)程的實(shí)現(xiàn)圖,，回顧一下:

在Linux中沒有嚴(yán)格的進(jìn)程線程之分，內(nèi)核沒有實(shí)現(xiàn)進(jìn)程控制塊,，只有一個(gè)task_struct,，它既是線程又是進(jìn)程的代理。當(dāng)進(jìn)程的第一個(gè)線程創(chuàng)建的時(shí)候,，此時(shí)進(jìn)程被創(chuàng)建,，進(jìn)程相應(yīng)的資源結(jié)構(gòu)體會(huì)被創(chuàng)建。當(dāng)進(jìn)程的最后一個(gè)線程死亡的時(shí)候,，進(jìn)程相應(yīng)的所有資源都會(huì)被釋放,，進(jìn)程就死亡了,。

參考文獻(xiàn)：

《Linux Kernel Development》
《Understanding the Linux Kernel》
《Professional Linux Kernel Architecture》
《The Linux Programming Interface》
《Advanced Programming in the UNIX Environment》
《Linkers & Loaders》
《程序員的自我修養(yǎng)》
《深度探索Linux操作系統(tǒng)》

https:///linux/man-pages/man2/fork.2.html

https:///linux/man-pages/man2/execve.2.html

https:///linux/man-pages/man3/exec.3.html

https:///linux/man-pages/man2/exit.2.html

https:///linux/man-pages/man3/exit.3.html

https:///linux/man-pages/man2/wait.2.html

https:///linux/man-pages/man2/wait4.2.html

http:///tutorials/debugging/linuxProgramStartup.html

歡迎大家添加客服小馬（Linuxer2022）、小李（Alex18518612669）和小月（linuxer2016）加入程磊老師的群聊與與大家一起討論學(xué)習(xí),。

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購買等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： mynotebook > 《待分類》

舉報(bào)/認(rèn)領(lǐng)