ShareTweet+ 1Mail Shared Libraries We’ve talked a bit about what object files and executables look like, so what do shared libraries look like? I’m going to focus on ELF shared libraries as used in SVR4 (and GNU/Linux, etc.), as they are the most flexible shared library implementation and the one I know best. Windows shared libraries, known as DLLs, are less flexible in that you have to compile code differently depending on whether it will go into a shared library or not. You also have to express symbol visibility in the source code. This is not inherently bad, and indeed ELF has picked up some of these ideas over time, but the ELF format makes more decisions at link time and is thus more powerful. When the program linker creates a shared library, it does not yet know which virtual address that shared library will run at. In fact, in different processes, the same shared library will run at different address, depending on the decisions made by the dynamic linker. This means that shared library code must be position independent. More precisely, it must be position independent after the dynamic linker has finished loading it. It is always possible for the dynamic linker to convert any piece of code to run at any virtula address, given sufficient relocation information. However, performing the reloc computations must be done every time the program starts, implying that it will start more slowly. Therefore, any shared library system seeks to generate position independent code which requires a minimal number of relocations to be applied at runtime, while still running at close to the runtime efficiency of position dependent code. An additional complexity is that ELF shared libraries were designed to be roughly equivalent to ordinary archives. This means that by default the main executable may override symbols in the shared library, such that references in the shared library will call the definition in the executable, even if the shared library also defines that same symbol. For example, an executable may define its own version of malloc. The C library also defines malloc, and the C library contains code which calls malloc. If the executable defines malloc itself, it will override the function in the C library. When some other function in the C library calls malloc, it will call the definition in the executable, not the definition in the C library. There are thus different requirements pulling in different directions for any specific ELF implementation. The right implementation choices will depend on the characteristics of the processor. That said, most, but not all, processors make fairly similar decisions. I will describe the common case here. An example of a processor which uses the common case is the i386; an example of a processor which make some different decisions is the PowerPC. In the common case, code may be compiled in two different modes. By default, code is position dependent. Putting position dependent code into a shared library will cause the program linker to generate a lot of relocation information, and cause the dynamic linker to do a lot of processing at runtime. Code may also be compiled in position independent mode, typically with the -fpic option. Position independent code is slightly slower when it calls a non-static function or refers to a global or static variable. However, it requires much less relocation information, and thus the dynamic linker will start the program faster. Position independent code will call non-static functions via the Procedure Linkage Table or PLT. This PLT does not exist in .o files. In a .o file, use of the PLT is indicated by a special relocation. When the program linker processes such a relocation, it will create an entry in the PLT. It will adjust the instruction such that it becomes a PC-relative call to the PLT entry. PC-relative calls are inherently position independent and thus do not require a relocation entry themselves. The program linker will create a relocation for the PLT entry which tells the dynamic linker which symbol is associated with that entry. This process reduces the number of dynamic relocations in the shared library from one per function call to one per function called. 位置無關(guān)代碼是通過過程鏈接表(PLT)來調(diào)用非static函數(shù)的。首先需要了解的是在目標(biāo)文件中是不存在PLT結(jié)構(gòu)的,,它只存在于可執(zhí)行文件或者共享庫中,,是由鏈接器在鏈接階段根據(jù)目標(biāo)文件中的引用外部函數(shù)的重定位信息而生成的,。每當(dāng)鏈接器遇到這種特殊的重定位信息時(shí),,它就會在PLT中插入一小段代碼,,即新增加一個(gè)PLT項(xiàng),。與此同時(shí),,鏈接器會調(diào)整代碼段.text中對該外部函數(shù)引用處的指令,使其變成對PLT項(xiàng)的相對PC跳轉(zhuǎn)指令,。而我們知道,,相對PC跳轉(zhuǎn)指令本身就是位置無關(guān)代碼,因此不需要額外地生成重定位信息,。 Further, PLT entries are normally relocated lazily by the dynamic linker. On most ELF systems this laziness may be overridden by setting the LD_BIND_NOW environment variable when running the program. However, by default, the dynamic linker will not actually apply a relocation to the PLT until some code actually calls the function in question. This also speeds up startup time, in that many invocations of a program will not call every possible function. This is particularly true when considering the shared C library, which has many more function calls than any typical program will execute. 此外,,還有一點(diǎn)非常重要的知識點(diǎn)就是通常情況下,對外部模塊函數(shù)的重定位操作都會被動態(tài)鏈接器延遲到運(yùn)行時(shí),。當(dāng)然了,,我們可以通過設(shè)置環(huán)境變量LD_BIND_NOW來改變這種延遲,使重定位操作在程序加載時(shí)就進(jìn)行完畢,,不過這樣勢必導(dǎo)致程序啟動緩慢——因?yàn)槲覀冎莱绦驅(qū)嶋H運(yùn)行過程中很多函數(shù)正常情況下是永遠(yuǎn)不會調(diào)用的,,典型的例子就是那些錯誤處理函數(shù),因此如果加載時(shí)就重定位程序中用到的所有外部模塊函數(shù),只會延長程序啟動時(shí)間,,而重定位的某些函數(shù)卻可能根本就用不到,,造成動態(tài)鏈接器做了很多無用功。因此,,默認(rèn)情況下都是程序需要調(diào)用函數(shù)時(shí),,動態(tài)鏈接器才進(jìn)行重定位操作,這樣一來不但加快了程序的啟動速度,,更是減輕了動態(tài)鏈接器的工作量,。 In order to make this work, the program linker initializes the PLT entries to load an index into some register or push it on the stack, and then to branch to common code. The common code calls back into the dynamic linker, which uses the index to find the appropriate PLT relocation, and uses that to find the function being called. The dynamic linker then initializes the PLT entry with the address of the function, and then jumps to the code of the function. The next time the function is called, the PLT entry will branch directly to the function. 實(shí)際需要重定位外部模塊函數(shù)時(shí),應(yīng)用程序先跳轉(zhuǎn)到對應(yīng)PLT代碼指令處,,在這里一般會先將當(dāng)前重定位函數(shù)在重定位表中的偏移值傳進(jìn)寄存器或者壓入棧中作為參數(shù),,然后跳轉(zhuǎn)到動態(tài)鏈接器中的專門用來解析函數(shù)地址的函數(shù),該函數(shù)根據(jù)剛剛傳遞的偏移值找到相應(yīng)的重定位條目,,從而確定需要解析的符號名字,,以及需要修訂的位置,最后解析出函數(shù)在內(nèi)存中的地址,。動態(tài)鏈接器將解析出的地址寫入對應(yīng)的GOT表項(xiàng)中,,最后調(diào)用真實(shí)的函數(shù)代碼。等到下次需要再次調(diào)用該函數(shù)時(shí)則直接根據(jù)GOT表項(xiàng)中的函數(shù)地址值直接調(diào)用即可,。 Before giving an example, I will talk about the other major data structure in position independent code, the Global Offset Table or GOT. This is used for global and static variables. For every reference to a global variable from position independent code, the compiler will generate a load from the GOT to get the address of the variable, followed by a second load to get the actual value of the variable. The address of the GOT will normally be held in a register, permitting efficient access. Like the PLT, the GOT does not exist in a .o file, but is created by the program linker. The program linker will create the dynamic relocations which the dynamic linker will use to initialize the GOT at runtime. Unlike the PLT, the dynamic linker always fully initializes the GOT when the program starts. 在位置無關(guān)代碼中,,除了PLT結(jié)構(gòu)外,還有一個(gè)非常重要的數(shù)據(jù)結(jié)構(gòu)是全局偏移表,,簡稱GOT,。它是用來訪問全局變量或者靜態(tài)變量(包括全局靜態(tài)變量和局部靜態(tài)變量)的,也可以配合上面提到的PLT來訪問函數(shù),。在位置無關(guān)代碼中,,每當(dāng)需要訪問全局變量時(shí),編譯器生成的代碼指令總是會去GOT表中找到變量的地址,,然后才會用這個(gè)地址去真正地訪問全局變量,。 通常情況下,為了高效地訪問GOT表,,編譯器都會將GOT表首地址存儲在一個(gè)寄存器中,。跟PLT一樣,目標(biāo)文件中也是不存在GOT表的,,它也是鏈接器在鏈接階段根據(jù)目前文件中特殊的重定位信息生成的,。此外,鏈接器還會生成動態(tài)重定位信息以便于動態(tài)鏈接器在程序運(yùn)行時(shí)來初始化GOT表,。跟PLT有一點(diǎn)不同的是,,在程序加載時(shí)動態(tài)鏈接器就會初始化GOT表中的所有跟訪問全局變量相關(guān)的表項(xiàng)。 For example, on the i386, the address of the GOT is held in the register %ebx. This register is initialized at the entry to each function in position independent code. The initialization sequence varies from one compiler to another, but typically looks something like this: call __i686.get_pc_thunk.bx add $offset,%ebx The function __i686.get_pc_thunk.bx simply looks like this: mov (%esp),%ebx ret This sequence of instructions uses a position independent sequence to get the address at which it is running. Then is uses an offset to get the address of the GOT. Note that this requires that the GOT always be a fixed offset from the code, regardless of where the shared library is loaded. That is, the dynamic linker must load the shared library as a fixed unit; it may not load different parts at varying addresses. Global and static variables are now read or written by first loading the address via a fixed offset from %ebx. The program linker will create dynamic relocations for each entry in the GOT, telling the dynamic linker how to initialize the entry. These relocations are of type GLOB_DAT. 因此在i386平臺上的位置無關(guān)代碼中,如果需要訪問(讀或?qū)懀┤肿兞炕蛘哽o態(tài)變量都是相對寄存器%ebx來實(shí)現(xiàn)的,。鏈接器需要為GOT表中的每一個(gè)表項(xiàng)生成重定位信息,,這樣動態(tài)鏈接器就知道如何初始化GOT表項(xiàng)了。這些重定位信息的類型是GLOB_DAT,。 For function calls, the program linker will set up a PLT entry to look like this: jmp *offset(%ebx) pushl #index jmp first_plt_entry The program linker will allocate an entry in the GOT for each entry in the PLT. It will create a dynamic relocation for the GOT entry of type JMP_SLOT. It will initialize the GOT entry to the base address of the shared library plus the address of the second instruction in the code sequence above. When the dynamic linker does the initial lazy binding on a JMP_SLOT reloc, it will simply add the difference between the shared library load address and the shared library base address to the GOT entry. The effect is that the first jmp instruction will jump to the second instruction, which will push the index entry and branch to the first PLT entry. The first PLT entry is special, and looks like this: pushl 4(%ebx) jmp *8(%ebx) This references the second and third entries in the GOT. The dynamic linker will initialize them to have appropriate values for a callback into the dynamic linker itself. The dynamic linker will use the index pushed by the first code sequence to find the JMP_SLOT relocation. When the dynamic linker determines the function to be called, it will store the address of the function into the GOT entry references by the first code sequence. Thus, the next time the function is called, the jmp instruction will branch directly to the right code. That was a fast pass over a lot of details, but I hope that it conveys the main idea. It means that for position independent code on the i386, every call to a global function requires one extra instruction after the first time it is called. Every reference to a global or static variable requires one extra instruction. Almost every function uses four extra instructions when it starts to initialize %ebx (leaf functions which do not refer to any global variables do not need to initialize %ebx). This all has some negative impact on the program cache. This is the runtime performance penalty paid to let the dynamic linker start the program quickly. 其實(shí)動態(tài)鏈接過程是很復(fù)雜的,,這里我只是簡單地介紹了下整體過程。現(xiàn)在我們只需要知道在i386平臺上的位置無關(guān)代碼實(shí)現(xiàn)中,,后續(xù)調(diào)用已經(jīng)解析過的外部模塊函數(shù)時(shí)需要執(zhí)行一條額外的指令,;每次訪問全局變量或者靜態(tài)變量,也需要多執(zhí)行一條指令,。而幾乎所有的函數(shù)都需要四條指令來初始化寄存器%ebx(只有少數(shù)函數(shù)不需要訪問全局函數(shù),,因此可以不用初始化寄存器%ebx)。所有這些額外的指令不但會影響程序的緩存,,而且也降低了程序的性能——這是為了讓動態(tài)鏈接器更快地加載程序而付出的代價(jià),。 On other processors, the details are naturally different. However, the general flavour is similar: position independent code in a shared library starts faster and runs slightly slower. 在其他處理器架構(gòu)上,這些實(shí)現(xiàn)細(xì)節(jié)彼此之間肯定是不同的,。但是不管怎樣,,整體特點(diǎn)應(yīng)該都是差不多的:位置無關(guān)代碼加快了程序的加載速度,但是運(yùn)行時(shí)有稍微的性能損失,。 More tomorrow.更多精彩,,明日繼續(xù)。 |
|