------------------------------------------
本文系本站原創(chuàng),歡迎轉(zhuǎn)載!
轉(zhuǎn)載請注明出處:http://ericxiao./
------------------------------------------
一:前言
文件的讀寫是文件系統(tǒng)中最核心也是最復雜的一部份,它牽涉到了很多的概念.之前分析文件系統(tǒng)其它操作的時候,遇到與文件系統(tǒng)相關(guān)的讀寫部份都忽略過去了.在這一節(jié)里,來討論一下文件的讀寫是怎樣實現(xiàn)的.
二:I/O請求的概述
如之前所提到的,為了提高文件的操作效率,文件系統(tǒng)中的內(nèi)容都是緩存在內(nèi)存里的.每當發(fā)起一個Rear/Write請求的時候,都會到頁面高速緩存中尋找具體的頁面.如果頁面不存在,則在頁面高速緩存中建立相關(guān)頁面的緩存.如果當前的頁面不是最新的.那就必須要到具體的文件系統(tǒng)中讀取數(shù)據(jù)了.一般來說,內(nèi)核提供了這樣的界面:它產(chǎn)生一個I/O請求.這個界面為上層隱藏了下層的不同實現(xiàn).在這個界面中,將產(chǎn)生的I/O請求提交給I/O調(diào)度.再與I/O調(diào)度調(diào)用具體的塊設(shè)備驅(qū)動程序.
整個過程如下圖所示:
上圖中的Generic Block Layer就是上面描述中所說的I/O的界面.
接下來我們以上圖從下到上的層次進行討論.
三:塊設(shè)備驅(qū)動
塊設(shè)備與字符設(shè)備的區(qū)別在于:塊設(shè)備可以隨機的訪問,例如磁盤.正是因為它可以隨機訪問,內(nèi)核才需要一個高效的手段去管理每一個塊設(shè)備.例如對磁盤的操作,每次移動磁針都需要花不少的時候,所以盡量讓其處理完相同磁道內(nèi)的請求再將磁針移動到另外的磁道.而對于字符設(shè)備來說,不存在這樣的顧慮,只需按順序從里面讀/寫就可以了.
先來看一下塊設(shè)備驅(qū)動所涉及到的數(shù)據(jù)結(jié)構(gòu).
3.1: block_device結(jié)構(gòu):
struct block_device {
//主次驅(qū)備號
dev_t bd_dev; /* not a kdev_t - it's a search key */
//指向bdev文件系統(tǒng)中塊設(shè)備對應的文件索引號
struct inode * bd_inode; /* will die */
//計數(shù)器,統(tǒng)計塊驅(qū)備被打開了多少次
int bd_openers;
// 塊設(shè)備打開和關(guān)閉的信號量
struct semaphore bd_sem; /* open/close mutex */
//禁止在塊設(shè)備上建行新安裝的信號量
struct semaphore bd_mount_sem; /* mount mutex */
//已打開的塊設(shè)備文件inode鏈表
struct list_head bd_inodes;
//塊設(shè)備描述符的當前擁有者
void * bd_holder;
//統(tǒng)計字段,統(tǒng)計對bd_holder進行更改的次數(shù)
int bd_holders;
//如果當前塊設(shè)備是一個分區(qū),此成員指向它所屬的磁盤的設(shè)備
//否則指向該描述符的本身
struct block_device * bd_contains;
//塊大小
unsigned bd_block_size;
//指向分區(qū)描述符的指針
struct hd_struct * bd_part;
/* number of times partitions within this device have been opened. */
//統(tǒng)計字段,統(tǒng)計塊設(shè)備分區(qū)被打開的次數(shù)
unsigned bd_part_count;
//讀取塊設(shè)備分區(qū)表時設(shè)置的標志
int bd_invalidated;
//指向塊設(shè)備所屬磁盤的gendisk
struct gendisk * bd_disk;
//指向塊設(shè)備描述符鏈表的指針
struct list_head bd_list;
//指向塊設(shè)備的專門描述符backing_dev_info
struct backing_dev_info *bd_inode_backing_dev_info;
/*
* Private data. You must have bd_claim'ed the block_device
* to use this. NOTE: bd_claim allows an owner to claim
* the same device multiple times, the owner must take special
* care to not mess up bd_private for that case.
*/
//塊設(shè)備的私有區(qū)
unsigned long bd_private;
}
通常,對于塊設(shè)備來說還涉及到一個分區(qū)問題.分區(qū)在內(nèi)核中是用hd_struct來表示的.
3.2: hd_struct結(jié)構(gòu):
struct hd_struct {
//磁盤分區(qū)的起始扇區(qū)
sector_t start_sect;
//分區(qū)的長度,即扇區(qū)的數(shù)目
sector_t nr_sects;
//內(nèi)嵌的kobject
struct kobject kobj;
//分區(qū)的讀操作次數(shù),讀取扇區(qū)數(shù),寫操作次數(shù),寫扇區(qū)數(shù)
unsigned reads, read_sectors, writes, write_sectors;
//policy:如果分區(qū)是只讀的,置為1.否則為0
//partno:磁盤中分區(qū)的相對索引
int policy, partno;
}
每個具體的塊設(shè)備都會都應一個磁盤,在內(nèi)核中磁盤用gendisk表示.
3.3: gendisk結(jié)構(gòu):
struct gendisk {
//磁盤的主驅(qū)備號
int major; /* major number of driver */
//與磁盤關(guān)聯(lián)的第一個設(shè)備號
int first_minor;
//與磁盤關(guān)聯(lián)的設(shè)備號范圍
int minors; /* maximum number of minors, =1 for
* disks that can't be partitioned. */
//磁盤的名字
char disk_name[32]; /* name of major driver */
//磁盤的分區(qū)描述符數(shù)組
struct hd_struct **part; /* [indexed by minor] */
//塊設(shè)備的操作指針
struct block_device_operations *fops;
//指向磁盤請求隊列指針
struct request_queue *queue;
//塊設(shè)備的私有區(qū)
void *private_data;
//磁盤內(nèi)存區(qū)大?。ㄉ葏^(qū)數(shù)目)
sector_t capacity;
//描述磁盤類型的標志
int flags;
//devfs 文件系統(tǒng)中的名字
char devfs_name[64]; /* devfs crap */
//不再使用
int number; /* more of the same */
//指向磁盤中硬件設(shè)備的device指針
struct device *driverfs_dev;
//內(nèi)嵌kobject指針
struct kobject kobj;
//記錄磁盤中斷定時器
struct timer_rand_state *random;
//如果只讀,此值為1.否則為0
int policy;
//寫入磁盤的扇區(qū)數(shù)計數(shù)器
atomic_t sync_io; /* RAID */
//統(tǒng)計磁盤隊列使用情況的時間戳
unsigned long stamp, stamp_idle;
//正在進行的I/O操作數(shù)
int in_flight;
//統(tǒng)計每個CPU使用磁盤的情況
#ifdef CONFIG_SMP
struct disk_stats *dkstats;
#else
struct disk_stats dkstats;
#endif
}
以上三個數(shù)據(jù)結(jié)構(gòu)的關(guān)系,如下圖所示:
如上圖所示:
每個塊設(shè)備分區(qū)的bd_contains會指它的總塊設(shè)備節(jié)點,它的bd_part會指向它的分區(qū)表.bd_disk會指向它所屬的磁盤.
從上圖中也可以看出:每個磁盤都會對應一個request_queue.對于上層的I/O請求就是通過它來完成的了.它的結(jié)構(gòu)如下:
3.4:request_queue結(jié)構(gòu):
struct request_queue
{
/*
* Together with queue_head for cacheline sharing
*/
//待處理請求的鏈表
struct list_head queue_head;
//指向隊列中首先可能合并的請求描述符
struct request *last_merge;
//指向I/O調(diào)度算法指針
elevator_t elevator;
/*
* the queue request freelist, one for reads and one for writes
*/
//為分配請請求描述符所使用的數(shù)據(jù)結(jié)構(gòu)
struct request_list rq;
//驅(qū)動程序策略例程入口點的方法
request_fn_proc *request_fn;
//檢查是否可能將bio合并到請求隊列的最后一個請求的方法
merge_request_fn *back_merge_fn;
//檢查是否可能將bio合并到請求隊列的第一個請求中的方法
merge_request_fn *front_merge_fn;
//試圖合并兩個相鄰請求的方法
merge_requests_fn *merge_requests_fn;
//將一個新請求插入請求隊列時所調(diào)用的方法
make_request_fn *make_request_fn;
//該方法反這個處理請求的命令發(fā)送給硬件設(shè)備
prep_rq_fn *prep_rq_fn;
//去掉塊設(shè)備方法
unplug_fn *unplug_fn;
//當增加一個新段時,該方法駝回可插入到某個已存在的bio 結(jié)構(gòu)中的字節(jié)數(shù)
merge_bvec_fn *merge_bvec_fn;
//將某個請求加入到請求隊列時,會調(diào)用此方法
activity_fn *activity_fn;
//刷新請求隊列時所調(diào)用的方法
issue_flush_fn *issue_flush_fn;
/*
* Auto-unplugging state
*/
//插入設(shè)備時所用到的定時器
struct timer_list unplug_timer;
//如果請求隊列中待處理請求數(shù)大于該值,將立即去掉請求設(shè)備
int unplug_thresh; /* After this many requests */
//去掉設(shè)備之間的延遲
unsigned long unplug_delay; /* After this many jiffies */
//去掉設(shè)備時使用的操作隊列
struct work_struct unplug_work;
//
struct backing_dev_info backing_dev_info;
/*
* The queue owner gets to use this for whatever they like.
* ll_rw_blk doesn't touch it.
*/
//指向塊設(shè)備驅(qū)動程序中的私有數(shù)據(jù)
void *queuedata;
//activity_fn()所用的參數(shù)
void *activity_data;
/*
* queue needs bounce pages for pages above this limit
*/
//如果頁框號大于該值,將使用回彈緩存沖
unsigned long bounce_pfn;
//回彈緩存區(qū)頁面的分配標志
int bounce_gfp;
/*
* various queue flags, see QUEUE_* below
*/
//描述請求隊列的標志
unsigned long queue_flags;
/*
* protects queue structures from reentrancy
*/
//指向請求隊列鎖的指針
spinlock_t *queue_lock;
/*
* queue kobject
*/
//內(nèi)嵌的kobject
struct kobject kobj;
/*
* queue settings
*/
//請求隊列中允許的最大請求數(shù)
unsigned long nr_requests; /* Max # of requests */
//如果待請求的數(shù)目超過了該值,則認為該隊列是擁擠的
unsigned int nr_congestion_on;
//如果待請求數(shù)目在這個閥值下,則認為該隊列是不擁擠的
unsigned int nr_congestion_off;
//單個請求所能處理的最大扇區(qū)(可調(diào)的)
unsigned short max_sectors;
//單個請求所能處理的最大扇區(qū)(硬約束)
unsigned short max_hw_sectors;
//單個請求所能處理的最大物理段數(shù)
unsigned short max_phys_segments;
//單個請求所能處理的最大物理段數(shù)(DMA的約束)
unsigned short max_hw_segments;
//扇區(qū)中以字節(jié) 為單位的大小
unsigned short hardsect_size;
//物理段的最大長度(以字節(jié)為單位)
unsigned int max_segment_size;
//段合并的內(nèi)存邊界屏弊字
unsigned long seg_boundary_mask;
//DMA緩沖區(qū)的起始地址和長度的對齊
unsigned int dma_alignment;
//空閑/忙標記的位圖.用于帶標記的請求
struct blk_queue_tag *queue_tags;
//請求隊列的引用計數(shù)
atomic_t refcnt;
//請求隊列中待處理的請求數(shù)
unsigned int in_flight;
/*
* sg stuff
*/
//用戶定義的命令超時
unsigned int sg_timeout;
//Not Use
unsigned int sg_reserved_size;
}
request_queue表示的是一個請求隊列,每一個請求都是用request來表示的.
3.5: request結(jié)構(gòu):
struct request {
//用來形成鏈表
struct list_head queuelist; /* looking for ->queue? you must _not_
* access it directly, use
* blkdev_dequeue_request! */
//請求描述符的標志
unsigned long flags; /* see REQ_ bits below */
/* Maintain bio traversal state for part by part I/O submission.
* hard_* are block layer internals, no driver should touch them!
*/
//要傳送的下一個扇區(qū)
sector_t sector; /* next sector to submit */
//要傳送的扇區(qū)數(shù)目
unsigned long nr_sectors; /* no. of sectors left to submit */
/* no. of sectors left to submit in the current segment */
//當前bio段傳送扇區(qū)的數(shù)目
unsigned int current_nr_sectors;
//要傳送的下一個扇區(qū)號
sector_t hard_sector; /* next sector to complete */
//整個過程中要傳送的扇區(qū)號
unsigned long hard_nr_sectors; /* no. of sectors left to complete */
/* no. of sectors left to complete in the current segment */
//當前bio段要傳送的扇區(qū)數(shù)目
unsigned int hard_cur_sectors;
/* no. of segments left to submit in the current bio */
//
unsigned short nr_cbio_segments;
/* no. of sectors left to submit in the current bio */
unsigned long nr_cbio_sectors;
struct bio *cbio; /* next bio to submit */
//請求中第一個沒有完成的bio
struct bio *bio; /* next unfinished bio to complete */
//最后的bio
struct bio *biotail;
//指向I/O調(diào)度的私有區(qū)
void *elevator_private;
//請求的狀態(tài)
int rq_status; /* should split this into a few status bits */
//請求所引用的磁盤描述符
struct gendisk *rq_disk;
//統(tǒng)計傳送失敗的計數(shù)
int errors;
//請求開始的時間
unsigned long start_time;
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
*/
//請求的物理段數(shù)
unsigned short nr_phys_segments;
/* Number of scatter-gather addr+len pairs after
* physical and DMA remapping hardware coalescing is performed.
* This is the number of scatter-gather entries the driver
* will actually have to deal with after DMA mapping is done.
*/
//請求的硬段數(shù)
unsigned short nr_hw_segments;
//與請求相關(guān)的標識
int tag;
//數(shù)據(jù)傳送的緩沖區(qū),如果是高端內(nèi)存,此成員值為NULL
char *buffer;
//請求的引用計數(shù)
int ref_count;
//指向包含請求的請求隊列描述符
request_queue_t *q;
struct request_list *rl;
//指向數(shù)據(jù)傳送終止的completion
struct completion *waiting;
//對設(shè)備發(fā)達“特殊請求所用到的指針”
void *special;
/*
* when request is used as a packet command carrier
*/
//cmd中的數(shù)據(jù)長度
unsigned int cmd_len;
//請求類型
unsigned char cmd[BLK_MAX_CDB];
//data中的數(shù)據(jù)長度
unsigned int data_len;
//為了跟蹤所傳輸?shù)臄?shù)據(jù)而使用的指針
void *data;
//sense字段的數(shù)據(jù)長度
unsigned int sense_len;
//指向輸出sense緩存區(qū)
void *sense;
//請求超時
unsigned int timeout;
/*
* For Power Management requests
*/
//指向電源管理命令所用的結(jié)構(gòu)
struct request_pm_state *pm;
}
請求隊列描述符與請求描述符都很復雜,為了簡化驅(qū)動的設(shè)計,內(nèi)核提供了一個API,供塊設(shè)備驅(qū)動程序來初始化一個請求隊列.這就是blk_init_queue().它的代碼如下:
//rfn:驅(qū)動程序自動提供的操作I/O的函數(shù).對應請求隊列的request_fn
//lock:驅(qū)動程序提供給請求隊列的自旋鎖
request_queue_t *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
{
request_queue_t *q;
static int printed;
//申請請求隊列描述符
q = blk_alloc_queue(GFP_KERNEL);
if (!q)
return NULL;
//初始化q->request_list
if (blk_init_free_list(q))
goto out_init;
if (!printed) {
printed = 1;
printk("Using %s io scheduler\n", chosen_elevator->elevator_name);
}
//初始化請求隊列描述符中的各項操作函數(shù)
q->request_fn = rfn;
q->back_merge_fn = ll_back_merge_fn;
q->front_merge_fn = ll_front_merge_fn;
q->merge_requests_fn = ll_merge_requests_fn;
q->prep_rq_fn = NULL;
q->unplug_fn = generic_unplug_device;
q->queue_flags = (1 << QUEUE_FLAG_CLUSTER);
q->queue_lock = lock;
blk_queue_segment_boundary(q, 0xffffffff);
//設(shè)置q->make_request_fn函數(shù),初始化等待隊對列的定時器和等待隊列
blk_queue_make_request(q, __make_request);
//設(shè)置max_segment_size,max_hw_segments,max_phys_segments
blk_queue_max_segment_size(q, MAX_SEGMENT_SIZE);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
/*
* all done
*/
//設(shè)置等待隊列的I/O調(diào)度程序
if (!elevator_init(q, chosen_elevator))
return q;
//失敗的處理
blk_cleanup_queue(q);
out_init:
kmem_cache_free(requestq_cachep, q);
return NULL;
}
這個函數(shù)中初始化了很多操作指針,這個函數(shù)在所有塊設(shè)備中都是一樣的,這樣就為通用塊設(shè)備層提供了一個統(tǒng)一的接口.對于塊設(shè)備驅(qū)動的接口就是我們在blk_init_queue中設(shè)置的策略例程了.留意一下關(guān)于請求隊列的各操作的設(shè)置,這在后續(xù)的分析中會用到.
另外,在請求結(jié)構(gòu)中涉及到了bio結(jié)構(gòu).bio表示一個段.目前內(nèi)核中關(guān)于I/O的所有操作都是由它來表示的.它的結(jié)構(gòu)如下所示:
struct bio {
//段的起始扇區(qū)
sector_t bi_sector;
//下一個bio
struct bio *bi_next; /* request queue link */
//段所在的塊設(shè)備
struct block_device *bi_bdev;
//bio的標志
unsigned long bi_flags; /* status, command, etc */
//Read/Write
unsigned long bi_rw; /* bottom bits READ/WRITE,
* top bits priority
*/
//bio_vec的項數(shù)
unsigned short bi_vcnt; /* how many bio_vec's */
//當前正在操作的bio_vec
unsigned short bi_idx; /* current index into bvl_vec */
/* Number of segments in this BIO after
* physical address coalescing is performed.
*/
//結(jié)合后的片段數(shù)目
unsigned short bi_phys_segments;
/* Number of segments after physical and DMA remapping
* hardware coalescing is performed.
*/
//重映射后的片段數(shù)目
unsigned short bi_hw_segments;
//I/O計數(shù)
unsigned int bi_size; /* residual I/O count */
/*
* To keep track of the max hw size, we account for the
* sizes of the first and last virtually mergeable segments
* in this bio
*/
//第一個可以合并的段大小
unsigned int bi_hw_front_size;
//最后一個可以合并的段大小
unsigned int bi_hw_back_size;
//最大的bio_vec項數(shù)
unsigned int bi_max_vecs; /* max bvl_vecs we can hold */
//bi_io_vec數(shù)組
struct bio_vec *bi_io_vec; /* the actual vec list */
//I/O完成的方法
bio_end_io_t *bi_end_io;
//使用計數(shù)
atomic_t bi_cnt; /* pin count */
//擁有者的私有區(qū)
void *bi_private;
//銷毀此bio的方法
bio_destructor_t *bi_destructor; /* destructor */
}
bio_vec的結(jié)構(gòu)如下:
struct bio_vec {
//bi_vec所表示的頁面
struct page *bv_page;
//數(shù)據(jù)區(qū)的長度
unsigned int bv_len;
//在頁面中的偏移量
unsigned int bv_offset;
}
關(guān)于bio與bio_vec的關(guān)系,用下圖表示:
現(xiàn)在,我們來思考一個問題:
當一個I/O請求提交給請求隊列后,它是怎么去調(diào)用塊設(shè)備驅(qū)動的策略例程去完成這次I/O的呢?還有,當一個I/O請求被提交給請求隊列時,會不會立即調(diào)用驅(qū)動中的策略例程去完成這次I/O呢,?
實際上,為了提高效率,所有的I/O都會在一個特定的延時之后才會調(diào)用策略例程去完成本次I/O.我們來看一個反面的例子,假設(shè)I/O在被提交后馬上得到執(zhí)行.例如.磁盤有磁針在磁盤12.現(xiàn)在有一個磁道1的請求.就會將磁針移動到磁道1.操作完后,又有一個請求過來了,它要操作磁道11.然后又會將磁針移到磁道11.操作完后,又有一個請求過來,要求操作磁道4.此時會將磁針移到磁道4.這個例子中,磁針移動的位置是:12->1->11->4.實際上,磁針的定位是一個很耗時的操作.這樣下去,毫無疑問會影響整個系統(tǒng)的效率.我們可以在整個延時內(nèi),將所有I/O操作按順序排列在一起,然后再調(diào)用策略例程.于是上例的磁針移動就會變成12->11->4->1.此時磁針只會往一個方向移動.
至于怎么樣排列請求和選取哪一個請求進行操作,這就是I/O調(diào)度的任務了.這部份我們在通用塊層再進行分析.
內(nèi)核中有兩個操作會完成上面的延時過程.即:激活塊設(shè)備驅(qū)動程序和撤消塊設(shè)備驅(qū)動程序.
3.6:塊設(shè)備驅(qū)動程序的激活和撤消
激活塊設(shè)備驅(qū)動程序和撤消塊設(shè)備驅(qū)動程序在內(nèi)核中對應的接口為blk_plug_device()和blk_remove_plug().分別看下它們的操作:
void blk_plug_device(request_queue_t *q)
{
WARN_ON(!irqs_disabled());
/*
* don't plug a stopped queue, it must be paired with blk_start_queue()
* which will restart the queueing
*/
//如果設(shè)置了QUEUE_FLAG_STOPPED.直接退出
if (test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags))
return;
//為請求隊列設(shè)置QUEUE_FLAG_PLUGGED.
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
//如果之前請求隊列的狀態(tài)不為QUEUE_FLAG_PLUGGED,則設(shè)置定時器超時時間
mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
}
int blk_remove_plug(request_queue_t *q)
{
WARN_ON(!irqs_disabled());
//將隊列QUEUE_FLAG_PLUGGED狀態(tài)清除
if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags))
//如果請求隊列之前不為QUEUE_FLAG_PLUGGED標志,直接返回
return 0;
//如果之前是QUEUE_FLAG_PLUGGED標志,則將定時器刪除
del_timer(&q->unplug_timer);
return 1;
}
如果請求隊列狀態(tài)為QUEUE_FLAG_PLUGGED,且定時器超時,會有什么樣的操作呢,?
回憶在請求隊列初始化函數(shù)中,blk_init_queue()會調(diào)用blk_queue_make_request().它的代碼如下:
void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
{
……
……
q->unplug_delay = (3 * HZ) / 1000; /* 3 milliseconds */
if (q->unplug_delay == 0)
q->unplug_delay = 1;
INIT_WORK(&q->unplug_work, blk_unplug_work, q);
q->unplug_timer.function = blk_unplug_timeout;
q->unplug_timer.data = (unsigned long)q;
……
……
}
上面設(shè)置了定時器的時間間隔為(3*HZ)/1000.定時器超時的處理函數(shù)為blk_unplug_timeout().參數(shù)為請求隊列本身.
blk_unplug_timeout()的代碼如下:
static void blk_unplug_timeout(unsigned long data)
{
request_queue_t *q = (request_queue_t *)data;
kblockd_schedule_work(&q->unplug_work);
}
從上面的代碼看出,定時器超時之后,會喚醒q->unplug_work這個工作對列.
在blk_queue_make_request()中,對這個工作隊列的初始化為:
INIT_WORK(&q->unplug_work, blk_unplug_work, q)
即工作隊列對應的函數(shù)為blk_unplug_work().對應的參數(shù)為請求隊列本身.代碼如下:
static void blk_unplug_work(void *data)
{
request_queue_t *q = data;
q->unplug_fn(q);
}
到此,就會調(diào)用請求隊列的unplug_fn()操作.
在blk_init_queue()對這個成員的賦值如下所示:
q->unplug_fn = generic_unplug_device;
generic_unplug_device()對應的代碼如下:
void __generic_unplug_device(request_queue_t *q)
{
//如果請求隊列是QUEUE_FLAG_STOPPED 狀態(tài),返回
if (test_bit(QUEUE_FLAG_STOPPED, &q->queue_flags))
return;
//如果請求隊列的狀態(tài)是QUEUE_FLAG_PLUGGED.就會返回1
if (!blk_remove_plug(q))
return;
/*
* was plugged, fire request_fn if queue has stuff to do
*/
//如果請求對列中的請求,則調(diào)用請求隊列的reauest_fn函數(shù).也就是驅(qū)動程序的
//策略例程
if (elv_next_request(q))
q->request_fn(q);
}
blk_remove_plug()在上面已經(jīng)分析過了.這里不再贅述.
歸根到底,最后的I/O完成操作都會調(diào)用塊設(shè)備驅(qū)動的策略例程來完成.
四:I/O調(diào)度層
I/O調(diào)度對應的結(jié)構(gòu)如下所示:
struct elevator_s
{
//當要插入一個bio時會調(diào)用
elevator_merge_fn *elevator_merge_fn;
elevator_merged_fn *elevator_merged_fn;
elevator_merge_req_fn *elevator_merge_req_fn;
//取得下一個請求
elevator_next_req_fn *elevator_next_req_fn;
//往請求隊列中增加請求
elevator_add_req_fn *elevator_add_req_fn;
elevator_remove_req_fn *elevator_remove_req_fn;
elevator_requeue_req_fn *elevator_requeue_req_fn;
elevator_queue_empty_fn *elevator_queue_empty_fn;
elevator_completed_req_fn *elevator_completed_req_fn;
elevator_request_list_fn *elevator_former_req_fn;
elevator_request_list_fn *elevator_latter_req_fn;
elevator_set_req_fn *elevator_set_req_fn;
elevator_put_req_fn *elevator_put_req_fn;
elevator_may_queue_fn *elevator_may_queue_fn;
//初始化與退出操作
elevator_init_fn *elevator_init_fn;
elevator_exit_fn *elevator_exit_fn;
void *elevator_data;
struct kobject kobj;
struct kobj_type *elevator_ktype;
//調(diào)度算法的名字
const char *elevator_name;
}
我們以最簡單的NOOP算法為例進行分析.
NOOP算法只是做簡單的請求合并的操作.的定義如下:
elevator_t elevator_noop = {
.elevator_merge_fn = elevator_noop_merge,
.elevator_merge_req_fn = elevator_noop_merge_requests,
.elevator_next_req_fn = elevator_noop_next_request,
.elevator_add_req_fn = elevator_noop_add_request,
.elevator_name = "noop",
}
挨個分析里面的各項操作:
elevator_noop_merge():在請求隊列中尋找能否有可以合并的請求.代碼如下:
int elevator_noop_merge(request_queue_t *q, struct request **req,
struct bio *bio)
{
struct list_head *entry = &q->queue_head;
struct request *__rq;
int ret;
//如果請求隊列中有l(wèi)ast_merge項.則判斷l(xiāng)ast_merge項是否能夠合并
//在NOOP中一般都不會設(shè)置last_merge
if ((ret = elv_try_last_merge(q, bio))) {
*req = q->last_merge;
return ret;
}
//遍歷請求隊列中的請求
while ((entry = entry->prev) != &q->queue_head) {
__rq = list_entry_rq(entry);
if (__rq->flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER))
break;
else if (__rq->flags & REQ_STARTED)
break;
//如果不是一個fs類型的請求?
if (!blk_fs_request(__rq))
continue;
//判斷能否與這個請求合并
if ((ret = elv_try_merge(__rq, bio))) {
*req = __rq;
q->last_merge = __rq;
return ret;
}
}
return ELEVATOR_NO_MERGE;
}
Elv_try_merge()用來判斷能否與請求合并,它的代碼如下:
inline int elv_try_merge(struct request *__rq, struct bio *bio)
{
int ret = ELEVATOR_NO_MERGE;
/*
* we can merge and sequence is ok, check if it's possible
*/
//判斷rq與bio是否為同類型的請求
if (elv_rq_merge_ok(__rq, bio)) {
//如果請求描述符中的起始扇區(qū)+ 扇區(qū)數(shù)= bio的起始扇區(qū)
//則將bio加到_rq的后面.
//返回ELEVATOR_BACK_MERGE
if (__rq->sector + __rq->nr_sectors == bio->bi_sector)
ret = ELEVATOR_BACK_MERGE;
//如果請求描述符中的起始扇區(qū)- 扇區(qū)數(shù)=bio的起始扇區(qū)
//則將bio加到_rq的前面
//返回ELEVATOR_FRONT_MERGE
else if (__rq->sector - bio_sectors(bio) == bio->bi_sector)
ret = ELEVATOR_FRONT_MERGE;
}
//如果不可以合并,返回ELEVATOR_NO_MERGE (值為0)
return ret;
}
elv_rq_merge_ok()代碼如下:
inline int elv_rq_merge_ok(struct request *rq, struct bio *bio)
{
//判斷rq是否可用
if (!rq_mergeable(rq))
return 0;
/*
* different data direction or already started, don't merge
*/
//操作是否相同
if (bio_data_dir(bio) != rq_data_dir(rq))
return 0;
/*
* same device and no special stuff set, merge is ok
*/
//要操作的對象是否一樣
if (rq->rq_disk == bio->bi_bdev->bd_disk &&
!rq->waiting && !rq->special)
return 1;
return 0;
}
注意:如果檢查成功返回1.失敗返回0.
elevator_noop_merge_requests():將next 從請求隊列中取出.代碼如下:
void elevator_noop_merge_requests(request_queue_t *q, struct request *req,
struct request *next)
{
list_del_init(&next->queuelist);
}
從上面的代碼中看到,NOOP算法從請求隊列中取出請求,只需要取鏈表結(jié)點即可.不需要進行額外的操作.
elevator_noop_next_request():取得下一個請求.代碼如下:
struct request *elevator_noop_next_request(request_queue_t *q)
{
if (!list_empty(&q->queue_head))
return list_entry_rq(q->queue_head.next);
return NULL;
}
很簡單,取鏈表的下一個結(jié)點.
elevator_noop_add_request():往請求隊列中插入一個請求.代碼如下:
void elevator_noop_add_request(request_queue_t *q, struct request *rq,
int where)
{
//默認是將rq插和到循環(huán)鏈表末尾
struct list_head *insert = q->queue_head.prev;
//如果要插到請求隊列的前面
if (where == ELEVATOR_INSERT_FRONT)
insert = &q->queue_head;
//不管是什么樣的操作,都將新的請求插入到請求隊列的末尾
list_add_tail(&rq->queuelist, &q->queue_head);
/*
* new merges must not precede this barrier
*/
if (rq->flags & REQ_HARDBARRIER)
q->last_merge = NULL;
else if (!q->last_merge)
q->last_merge = rq;
}
五:通用塊層的處理
通用塊層的入口點為generic_make_request().它的代碼如下:
void generic_make_request(struct bio *bio)
{
request_queue_t *q;
sector_t maxsector;
//nr_sectors:要操作的扇區(qū)數(shù)
int ret, nr_sectors = bio_sectors(bio);
//可能會引起睡眠
might_sleep();
/* Test device or partition size, when known. */
//最大扇區(qū)數(shù)目
maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
if (maxsector) {
//bio操作的起始扇區(qū)
sector_t sector = bio->bi_sector;
//如果最大扇區(qū)數(shù)<要操作的扇區(qū)數(shù)or 最大扇區(qū)數(shù)與起始扇區(qū)的差值小于要操作的扇區(qū)數(shù)
//非法的情況
if (maxsector < nr_sectors ||
maxsector - nr_sectors < sector) {
char b[BDEVNAME_SIZE];
/* This may well happen - the kernel calls
* bread() without checking the size of the
* device, e.g., when mounting a device. */
printk(KERN_INFO
"attempt to access beyond end of device\n");
printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n",
bdevname(bio->bi_bdev, b),
bio->bi_rw,
(unsigned long long) sector + nr_sectors,
(long long) maxsector);
set_bit(BIO_EOF, &bio->bi_flags);
goto end_io;
}
}
/*
* Resolve the mapping until finished. (drivers are
* still free to implement/resolve their own stacking
* by explicitly returning 0)
*
* NOTE: we don't repeat the blk_size check for each new device.
* Stacking drivers are expected to know what they are doing.
*/
do {
char b[BDEVNAME_SIZE];
//取得塊設(shè)備的請求對列
q = bdev_get_queue(bio->bi_bdev);
if (!q) {
//請求隊列不存在
printk(KERN_ERR
"generic_make_request: Trying to access "
"nonexistent block-device %s (%Lu)\n",
bdevname(bio->bi_bdev, b),
(long long) bio->bi_sector);
end_io:
//最終會調(diào)用bio->bi_end_io
bio_endio(bio, bio->bi_size, -EIO);
break;
}
//非法的情況
if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) {
printk("bio too big device %s (%u > %u)\n",
bdevname(bio->bi_bdev, b),
bio_sectors(bio),
q->max_hw_sectors);
goto end_io;
}
//如果請求隊列為QUEUE_FLAG_DEAD
//退出
if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))
goto end_io;
/*
* If this device has partitions, remap block n
* of partition p to block n+start(p) of the disk.
*/
//如果當前塊設(shè)備是一個分區(qū),則轉(zhuǎn)到分區(qū)所屬的塊設(shè)備
blk_partition_remap(bio);
//調(diào)用請求隊列的make_request_fn()
ret = q->make_request_fn(q, bio);
} while (ret);
}
在blk_init_queue()中對請求隊列的make_request_fn的設(shè)置如下所示:
blk_init_queue()—> blk_queue_make_request(q, __make_request)
void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
{
……
……
q->make_request_fn = mfn;
……
}
這里,等待隊對的make_request_fn就被設(shè)置為了__make_request.這個函數(shù)的代碼如下:
static int __make_request(request_queue_t *q, struct bio *bio)
{
struct request *req, *freereq = NULL;
int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, err;
sector_t sector;
//bio的起始扇區(qū)
sector = bio->bi_sector;
//扇區(qū)數(shù)目
nr_sectors = bio_sectors(bio);
//當前bio中的bio_vec的扇區(qū)數(shù)目
cur_nr_sectors = bio_cur_sectors(bio);
//讀/寫
rw = bio_data_dir(bio);
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
* ISA dma in theory)
*/
//建立一個彈性回環(huán)緩存
blk_queue_bounce(q, &bio);
spin_lock_prefetch(q->queue_lock);
barrier = bio_barrier(bio);
if (barrier && !(q->queue_flags & (1 << QUEUE_FLAG_ORDERED))) {
err = -EOPNOTSUPP;
goto end_io;
}
again:
spin_lock_irq(q->queue_lock);
//請求隊列是空的
if (elv_queue_empty(q)) {
//激活塊設(shè)備驅(qū)動
blk_plug_device(q);
goto get_rq;
}
if (barrier)
goto get_rq;
//調(diào)用I/O調(diào)度的elevator_merge_fn方法,判斷這個bio能否和其它請求合并
//如果可以合并,req參數(shù)將返回與之合并的請求描述符
el_ret = elv_merge(q, &req, bio);
switch (el_ret) {
//可以合并.且bio加到req的后面
case ELEVATOR_BACK_MERGE:
BUG_ON(!rq_mergeable(req));
if (!q->back_merge_fn(q, req, bio))
break;
req->biotail->bi_next = bio;
req->biotail = bio;
req->nr_sectors = req->hard_nr_sectors += nr_sectors;
drive_stat_acct(req, nr_sectors, 0);
if (!attempt_back_merge(q, req))
elv_merged_request(q, req);
goto out;
//可以合并.且bio加到req的前面
case ELEVATOR_FRONT_MERGE:
BUG_ON(!rq_mergeable(req));
if (!q->front_merge_fn(q, req, bio))
break;
bio->bi_next = req->bio;
req->cbio = req->bio = bio;
req->nr_cbio_segments = bio_segments(bio);
req->nr_cbio_sectors = bio_sectors(bio);
/*
* may not be valid. if the low level driver said
* it didn't need a bounce buffer then it better
* not touch req->buffer either...
*/
req->buffer = bio_data(bio);
req->current_nr_sectors = cur_nr_sectors;
req->hard_cur_sectors = cur_nr_sectors;
req->sector = req->hard_sector = sector;
req->nr_sectors = req->hard_nr_sectors += nr_sectors;
drive_stat_acct(req, nr_sectors, 0);
if (!attempt_front_merge(q, req))
elv_merged_request(q, req);
goto out;
/*
* elevator says don't/can't merge. get new request
*/
//不可以合并.申請一個新的請求,將且加入請求隊列
case ELEVATOR_NO_MERGE:
break;
default:
printk("elevator returned crap (%d)\n", el_ret);
BUG();
}
/*
* Grab a free request from the freelist - if that is empty, check
* if we are doing read ahead and abort instead of blocking for
* a free slot.
*/
get_rq:
//freereq:是新分配的請求描述符
if (freereq) {
req = freereq;
freereq = NULL;
} else {
//分配一個請求描述符
spin_unlock_irq(q->queue_lock);
if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
/*
* READA bit set
*/
//分配失敗
err = -EWOULDBLOCK;
if (bio_rw_ahead(bio))
goto end_io;
freereq = get_request_wait(q, rw);
}
goto again;
}
req->flags |= REQ_CMD;
/*
* inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST)
*/
if (bio_rw_ahead(bio) || bio_failfast(bio))
req->flags |= REQ_FAILFAST;
/*
* REQ_BARRIER implies no merging, but lets make it explicit
*/
if (barrier)
req->flags |= (REQ_HARDBARRIER | REQ_NOMERGE);
//初始化新分配的請求描述符
req->errors = 0;
req->hard_sector = req->sector = sector;
req->hard_nr_sectors = req->nr_sectors = nr_sectors;
req->current_nr_sectors = req->hard_cur_sectors = cur_nr_sectors;
req->nr_phys_segments = bio_phys_segments(q, bio);
req->nr_hw_segments = bio_hw_segments(q, bio);
req->nr_cbio_segments = bio_segments(bio);
req->nr_cbio_sectors = bio_sectors(bio);
req->buffer = bio_data(bio); /* see ->buffer comment above */
req->waiting = NULL;
//將bio 關(guān)聯(lián)到請求描述符
req->cbio = req->bio = req->biotail = bio;
req->rq_disk = bio->bi_bdev->bd_disk;
req->start_time = jiffies;
//請將求描述符添加到請求隊列中
add_request(q, req);
out: (R)
if (freereq)
__blk_put_request(q, freereq);
//如果定義了BIO_RW_SYNC.
//將調(diào)用__generic_unplug_device將塊設(shè)備驅(qū)動,它會直接調(diào)用驅(qū)動程序的策略例程
if (bio_sync(bio))
__generic_unplug_device(q);
spin_unlock_irq(q->queue_lock);
return 0;
end_io:
bio_endio(bio, nr_sectors << 9, err);
return 0;
}
這個函數(shù)的邏輯比較簡單,它判斷bio能否與請求隊列中存在的請求合并,如果可以合并,將其它合并到現(xiàn)有的請求.如果不能合并,則新建一個請求描述符,然后把它插入到請求隊列中.上面的代碼可以結(jié)合之前分析的NOOP算法進行理解.
重點分析一下請求描述符的分配過程:
分配一個請求描述符的過程如下所示:
if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
/*
* READA bit set
*/
//分配失敗
err = -EWOULDBLOCK;
if (bio_rw_ahead(bio))
goto end_io;
freereq = get_request_wait(q, rw);
}
在分析這段代碼之前,先來討論一下關(guān)于請求描述符的分配方式.記得我們在分析請求隊列描述符的時候,request_queue中有一個成員:struct request_list rq;
它的數(shù)據(jù)結(jié)構(gòu)如下:
struct request_list {
//讀/寫請求描述符的分配計數(shù)
int count[2];
//分配緩存池
mempool_t *rq_pool;
//如果沒有空閑內(nèi)存時.讀/寫請求的等待隊列
wait_queue_head_t wait[2];
};
如果當前空閑內(nèi)存不夠.則會將請求的進程掛起.如果分配成功,則將請求隊列的rl字段指向這個分配的request_list.
釋放一個請求描述符,將會將其歸還給指定的內(nèi)存池.
request_list結(jié)構(gòu)還有一個避免請求擁塞的作用:
每個請求隊列都有一個允許處理請求的最大值(request_queue->nr_requests).如果隊列中的請求超過了這個數(shù)值,則將隊列置為QUEUE_FLAG_READFULL/QUEUE_FLAG_WRITEFULL.后續(xù)試圖加入到隊列的進程就會被放置到request_list結(jié)構(gòu)所對應的等待隊列中睡眠.如果一個隊列中的睡眠進程過程也多也會影響系統(tǒng)的效率.如果待處理的請求大于request_queue-> nr_congestion_on就會認為這個隊列是擁塞的.就會試圖降低新請求的創(chuàng)建速度.如果待處理請求小于request_queue->nr_congestion_off.則會認為當前隊列是不擁塞的.
get_request()的代碼如下:
static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
{
struct request *rq = NULL;
struct request_list *rl = &q->rq;
struct io_context *ioc = get_io_context(gfp_mask);
spin_lock_irq(q->queue_lock);
//如果請求數(shù)超過了請求隊列允許的最大請求值(q->nr_requests)
//就會將后續(xù)的請求進程投入睡眠
if (rl->count[rw]+1 >= q->nr_requests) {
/*
* The queue will fill after this allocation, so set it as
* full, and mark this process as "batching". This process
* will be allowed to complete a batch of requests, others
* will be blocked.
*/
//判斷是否將隊列置為了QUEUE_FLAG_READFULL/QUEUE_FLAG_WRITEFULL
//如果沒有,則置此標志.并且設(shè)置當前進程為batching
if (!blk_queue_full(q, rw)) {
ioc_set_batching(ioc);
blk_set_queue_full(q, rw);
}
}
//如果隊列滿了,進程不為batching 且I/O調(diào)度程序不能忽略它
//不能分配.直接返回
if (blk_queue_full(q, rw)
&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
/*
* The queue is full and the allocating process is not a
* "batcher", and not exempted by the IO scheduler
*/
spin_unlock_irq(q->queue_lock);
goto out;
}
//要分配請求描述符了,遞增計數(shù)
rl->count[rw]++;
//如果待請求數(shù)量超過了request_queue-> nr_congestion_on
//則隊列是阻塞的,設(shè)置阻塞標志
if (rl->count[rw] >= queue_congestion_on_threshold(q))
set_queue_congested(q, rw);
spin_unlock_irq(q->queue_lock);
//分配請求描述符
rq = blk_alloc_request(q, gfp_mask);
if (!rq) {
/*
* Allocation failed presumably due to memory. Undo anything
* we might have messed up.
*
* Allocating task should really be put onto the front of the
* wait queue, but this is pretty rare.
*/
spin_lock_irq(q->queue_lock);
//分配失敗了,要減小分配描述的引用計數(shù)
freed_request(q, rw);
spin_unlock_irq(q->queue_lock);
goto out;
}
if (ioc_batching(ioc))
ioc->nr_batch_requests--;
//初始化請求的各字段
INIT_LIST_HEAD(&rq->queuelist);
/*
* first three bits are identical in rq->flags and bio->bi_rw,
* see bio.h and blkdev.h
*/
rq->flags = rw;
rq->errors = 0;
rq->rq_status = RQ_ACTIVE;
rq->bio = rq->biotail = NULL;
rq->buffer = NULL;
rq->ref_count = 1;
rq->q = q;
rq->rl = rl;
rq->waiting = NULL;
rq->special = NULL;
rq->data_len = 0;
rq->data = NULL;
rq->sense = NULL;
out:
//減少ioc的引用計數(shù)
put_io_context(ioc);
return rq;
}
由于在分配之前遞增了統(tǒng)計計數(shù),所以在分配失敗后,要把這個統(tǒng)計計數(shù)減下來,這是由freed_request()完成的.它的代碼如下:
static void freed_request(request_queue_t *q, int rw)
{
struct request_list *rl = &q->rq;
rl->count[rw]--;
//如果分配計數(shù)小于request_queue->nr_congestion_off.隊列已經(jīng)不擁塞了
if (rl->count[rw] < queue_congestion_off_threshold(q))
clear_queue_congested(q, rw);
//如果計數(shù)小于允許的最大值.那可以分配請求了,將睡眠的進程喚醒
if (rl->count[rw]+1 <= q->nr_requests) {
//喚醒等待進程
if (waitqueue_active(&rl->wait[rw]))
wake_up(&rl->wait[rw]);
//清除QUEUE_FLAG_READFULL/QUEUE_FLAG_WRITEFULL
blk_clear_queue_full(q, rw);
}
}
在這里我們可以看到,如果待處理請求小于請求隊列所允許的最大值,就會將睡眠的進程喚醒.
如果請求描述符分配失敗,會怎么樣呢,?我們接著看__make_request()中的代碼:
if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
/*
* READA bit set
*/
//分配失敗
err = -EWOULDBLOCK;
//如果此次操作是一次預讀,且不阻塞
if (bio_rw_ahead(bio))
goto end_io;
//掛起進程
freereq = get_request_wait(q, rw);
}
如果分配失敗,會調(diào)用get_request_wait()將進程掛起.它的代碼如下:
static struct request *get_request_wait(request_queue_t *q, int rw)
{
//初始化一個等待隊列
DEFINE_WAIT(wait);
struct request *rq;
struct io_context *ioc;
//撤消塊設(shè)備驅(qū)動.這里會直接調(diào)用塊設(shè)備驅(qū)動的策略例程
generic_unplug_device(q);
ioc = get_io_context(GFP_NOIO);
do {
struct request_list *rl = &q->rq;
//將當前進程加入等待隊列.并設(shè)置進程狀態(tài)為TASK_UNINTERRUPTIBLE
prepare_to_wait_exclusive(&rl->wait[rw], &wait,
TASK_UNINTERRUPTIBLE);
//再次獲得等待隊列
rq = get_request(q, rw, GFP_NOIO);
if (!rq) {
//如果還是失敗了,睡眠
io_schedule();
/*
* After sleeping, we become a "batching" process and
* will be able to allocate at least one request, and
* up to a big batch of them for a small period time.
* See ioc_batching, ioc_set_batching
*/
//這里是被喚醒之后運行
ioc_set_batching(ioc);
}
//將進程從等待隊列中刪除
finish_wait(&rl->wait[rw], &wait);
} while (!rq);
put_io_context(ioc);
return rq;
}
這段代碼比較簡單,相似的代碼我們在之前已經(jīng)分析過很多次了.這里不做重點分析.
此外.在__make_request()中還需要注意一件事情.在bio中的內(nèi)存可能是高端內(nèi)存的.但是內(nèi)核不能直接訪問,這里就必須要對處理高端內(nèi)存的bio_vec做下處理.即將它臨時映射之后copy到普通內(nèi)存區(qū).這就是所謂的彈性回環(huán)緩存.相關(guān)的操作是在blk_queue_bounce()中完成的.這個函數(shù)比較簡單,可以自行分析.
到這里,通用塊層的處理分析就結(jié)束了.我們繼續(xù)分析其它的層次. |
|