《Troubleshooting SQL Server》讀書筆記

看見就非常 2014-07-14

展開全文

第一章 A Performance Troubleshooting Methodology,。

確定具體的方法論用于SQL Server故障診斷是件很難的事情,。因?yàn)楦鶕?jù)問題和環(huán)境的不同,，解決方法也不一樣,。一個準(zhǔn)確的故障診斷,，不僅是收集各種信息,，而且要明白它們的含義,。千萬不要只見樹木不見森林,。

作者通過一個簡單的診斷示例,，演示基本的方法和借助的工具：

1. 首先查看DMV sys.dm_os_wait_stats來確定操作系統(tǒng)層面的主要的資源等待,。

2. 假設(shè)發(fā)現(xiàn)主要等待類型為PAGEIOLATCH_SH。PAGEIOLATCH_SH發(fā)生的原因是磁盤子系統(tǒng)的速度不能滿足DB對頁的請求速度,。

看起來是IO瓶頸,。但是我們要考慮周全，也有可能是其它的問題導(dǎo)致的過度IO,。而這是表現(xiàn)不是根本原因,。

3. 然后通過sys.dm_io_virtual_file_stats得到各個數(shù)據(jù)庫和數(shù)據(jù)文件的IO量和IO延時。同時查看性能計(jì)數(shù)器Physical Disk\Avg. Disk Reads/sec和Physical Disk\Avg. Disk Writes/sec,。

這樣就能把問題定位到具體數(shù)據(jù)庫,。

4. 針對上面確定的數(shù)據(jù)庫，通過sys.dm_exec_query_stats查看是哪些查詢造成了大量的IO.找出這些語句調(diào)整或者采取其它的調(diào)優(yōu)方式,。

5. 同時通過觀察性能計(jì)數(shù)器Page Life Expectancy,，F(xiàn)ree List Stalls/sec和high Lazy Writes/sec,如果PLE值很低，而后兩個計(jì)數(shù)器常不為0,，則可認(rèn)為是內(nèi)存瓶頸,。

在這個故障診斷的案例中有許多要關(guān)注的點(diǎn)，一不小心就會被帶坑里去,，采用了錯誤的調(diào)優(yōu)方式,。比如，從一始我們就認(rèn)定是IO瓶頸,，于是提高IO子系統(tǒng)的速度,。雖然短時會改善性能，但是過段時間這個問題又會出現(xiàn),。

所以千萬不要只見樹木不見森林,。

等待統(tǒng)計(jì)(Wait Statistics: the Basis for Troubleshooting):

在故障診斷的第一個檢查的項(xiàng)一般會是等待統(tǒng)計(jì)。SQL Server有一個偽操作系統(tǒng)SQLOS,，管理著內(nèi)部的各種操作和資源,。而SQLOS中的操作和資源的各種等待，可以通過sys.dm_os_wait_stats查詢得到,。

因?yàn)楹芏鄷r候?qū)栴}現(xiàn)狀的描述不是很清楚,，比如“數(shù)據(jù)庫好慢”,，所以我們搞明白SQL Server在待什么和為什么等待,。

下面的查詢已經(jīng)過濾掉非問題型等待：

SELECT TOP  10  
        wait_type , 
        max_wait_time_ms wait_time_ms , 
        signal_wait_time_ms , 
        wait_time_ms - signal_wait_time_ms AS resource_wait_time_ms , 
        100.0 * wait_time_ms / SUM (wait_time_ms ) OVER  ( ) 
                                    AS percent_total_waits , 
        100.0 * signal_wait_time_ms / SUM (signal_wait_time_ms) OVER  ( ) 
                                    AS percent_total_signal_waits , 
        100.0 * ( wait_time_ms - signal_wait_time_ms ) 
         / SUM (wait_time_ms ) OVER  ( ) AS percent_total_resource_waits  
FROM    sys .dm_os_wait_stats 
WHERE    wait_time_ms > 0  -- remove zero wait_time  
         AND  wait_type NOT  IN -- filter out additional irrelevant waits  
( 'SLEEP_TASK' , 'BROKER_TASK_STOP', 'BROKER_TO_FLUSH' , 
  'SQLTRACE_BUFFER_FLUSH','CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT',  
  'LAZYWRITER_SLEEP', 'SLEEP_SYSTEMTASK', 'SLEEP_BPOOL_FLUSH', 
  'BROKER_EVENTHANDLER', 'XE_DISPATCHER_WAIT' , 'FT_IFTSHC_MUTEX' , 
  'CHECKPOINT_QUEUE', 'FT_IFTS_SCHEDULER_IDLE_WAIT' ,  
  'BROKER_TRANSMITTER' , 'FT_IFTSHC_MUTEX' , 'KSOURCE_WAKEUP', 
  'LAZYWRITER_SLEEP', 'LOGMGR_QUEUE', 'ONDEMAND_TASK_QUEUE', 
  'REQUEST_FOR_DEADLOCK_SEARCH' , 'XE_TIMER_EVENT', 'BAD_PAGE_PROCESS', 
  'DBMIRROR_EVENTS_QUEUE', 'BROKER_RECEIVE_WAITFOR' , 
  'PREEMPTIVE_OS_GETPROCADDRESS', 'PREEMPTIVE_OS_AUTHENTICATIONOPS', 
  'WAITFOR', 'DISPATCHER_QUEUE_SEMAPHORE', 'XE_DISPATCHER_JOIN' , 
  'RESOURCE_QUEUE' ) 
ORDER  BY wait_time_ms DESC

一些常見的等待類型有：

  CXPACKET:并行執(zhí)行等待

  SOS_SCHEDULER_YIELD：每個worker在占用scheduler一會兒然后退讓(Yield)，讓其它的worker使用scheduler,。此等待通常說明有CPU壓力,。

  THREADPOOL:每個task需要綁定到一個worker執(zhí)行。此等待說明worker用盡,，有CPU壓力,，或者阻塞導(dǎo)致大量的task長時間占用woker,。

  LCK_*：鎖等待。

  PAGEIOLATCH_*,  IO_COMPLETION,  WRITELOG：IO瓶頸的表現(xiàn),。

  PAGELATCH_*：頁爭用等待,。此等待最出名代表就是tempdb的系統(tǒng)頁爭用問題。

  LATCH_*:是一種輕量短暫同步對象,，用于保護(hù)訪問的內(nèi)部對象,。根據(jù)等待類型不同，代表著不同的問題,。sys.dm_os_latch_stats

  ASYNC_NETWORK_IO：通常是客戶端處理數(shù)據(jù)性能不夠快,。

根據(jù)不同等待類型確定，來確定我們下一步診斷方向,。如果PAGEIOLATCH_*等待過高,，接下來就會檢查虛擬文件狀態(tài)，如果LCK_*等待過高,，就會檢查數(shù)據(jù)庫中的阻塞狀況,，而不會浪費(fèi)時間去檢查IO配置。

在解決完一個問題后需要清空等待信息,，然后等過段時間DMV中的數(shù)據(jù)積累到一定程度,，再來確定問題是否還存在。如下語句清空：

DBCC SQLPERF ('sys.dm_os_wait_stats' , clear )

虛擬文件統(tǒng)計(jì)(Virtual File Statistics):

只看等待統(tǒng)計(jì),，很多問題都像是IO瓶頸,。所以我們要檢查虛擬文件統(tǒng)計(jì)。

SELECT  DB_NAME(vfs .database_id) AS database_name  , 
        vfs.database_id  , 
        vfs.FILE_ID , 
        io_stall_read_ms / NULLIF(num_of_reads , 0) AS avg_read_latency , 
        io_stall_write_ms / NULLIF(num_of_writes, 0) 
                                               AS avg_write_latency , 
        io_stall  / NULLIF(num_of_reads + num_of_writes, 0) 
                                               AS avg_total_latency , 
        num_of_bytes_read / NULLIF(num_of_reads , 0) 
                                               AS avg_bytes_per_read  , 
        num_of_bytes_written  / NULLIF(num_of_writes, 0) 
                                               AS avg_bytes_per_write , 
        vfs.io_stall , 
        vfs.num_of_reads , 
        vfs.num_of_bytes_read , 
        vfs.io_stall_read_ms  , 
        vfs.num_of_writes , 
        vfs.num_of_bytes_written , 
        vfs.io_stall_write_ms , 
        size_on_disk_bytes  / 1024 / 1024. AS size_on_disk_mbytes , 
        physical_name 
FROM    sys .dm_io_virtual_file_stats (NULL, NULL)  AS vfs 
         JOIN sys .master_files  AS mf ON vfs.database_id  = mf .database_id 
                                        AND  vfs.FILE_ID = mf .FILE_ID 
ORDER  BY avg_total_latency DESC

發(fā)現(xiàn)有高IO等待意味著有IO瓶頸,，但是其根本原因可能是磁盤了系統(tǒng)有瓶頸,，也可能是索引過多、索引缺失,、內(nèi)存瓶頸和低效的查詢等,。還要借助其它方式進(jìn)一步確定根本原因。千萬不要只見樹木不見森林,。

性能計(jì)數(shù)器(Performance Counters)

在診斷初期,，會先收集針對SQL Server的性能計(jì)數(shù)器，隨著深入再收集Windows的計(jì)數(shù)器,。這些數(shù)據(jù)有助于明白當(dāng)前的性能狀況和界定問題的邊界,。

推薦的分析工具PAL?？梢允褂肞erfMon收集,，也可以通過 sys.dm_os_performance_counters查看。

使用DMV時，要注意計(jì)數(shù)器數(shù)據(jù)是增量的,，所以要收集兩個時間點(diǎn)上的數(shù)據(jù)副本進(jìn)行對比,。

DECLARE @CounterPrefix NVARCHAR (30) 
SET  @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER' 
                             THEN 'SQLServer:'  
                          ELSE 'MSSQL$'  + @@SERVICENAME + ':'  
                     END  ; 
-- Capture the first counter set  
SELECT  CAST(1 AS INT ) AS collection_instance , 
        [OBJECT_NAME]  , 
        counter_name , 
        instance_name  , 
        cntr_value , 
        cntr_type , 
         CURRENT_TIMESTAMP  AS collection_time 
INTO    #perf_counters_init 
FROM    sys .dm_os_performance_counters 
WHERE    ( OBJECT_NAME = @CounterPrefix + 'Access Methods' 
           AND  counter_name = 'Full Scans/sec' 
         ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Access Methods' 
             AND  counter_name = 'Index Searches/sec'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager' 
             AND  counter_name = 'Lazy Writes/sec'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager' 
             AND  counter_name = 'Page life expectancy'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'  
             AND  counter_name = 'Processes Blocked' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'  
             AND  counter_name = 'User Connections' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Locks' 
             AND  counter_name = 'Lock Waits/sec' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Locks' 
             AND  counter_name = 'Lock Wait Time (ms)' 
           ) 

         OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics' 
             AND  counter_name = 'SQL Re-Compilations/sec' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Memory Manager' 
             AND  counter_name = 'Memory Grants Pending' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics' 
             AND  counter_name = 'Batch Requests/sec'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics' 
             AND  counter_name = 'SQL Compilations/sec'  
           ) 
 
-- Wait on Second between data collection  
WAITFOR DELAY  '00:00:01'  
 
-- Capture the second counter set 
SELECT  CAST(2 AS INT ) AS collection_instance , 
         OBJECT_NAME , 
        counter_name , 
        instance_name  , 
        cntr_value , 
        cntr_type , 
         CURRENT_TIMESTAMP  AS collection_time 
INTO    #perf_counters_second 
FROM    sys .dm_os_performance_counters 
WHERE    ( OBJECT_NAME = @CounterPrefix + 'Access Methods' 
           AND  counter_name = 'Full Scans/sec' 
         ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Access Methods' 
             AND  counter_name = 'Index Searches/sec'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager' 
             AND  counter_name = 'Lazy Writes/sec'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Buffer Manager' 
             AND  counter_name = 'Page life expectancy'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'  
             AND  counter_name = 'Processes Blocked' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'General Statistics'  
             AND  counter_name = 'User Connections' 
           ) 

         OR ( OBJECT_NAME = @CounterPrefix + 'Locks' 
             AND  counter_name = 'Lock Waits/sec' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Locks' 
             AND  counter_name = 'Lock Wait Time (ms)' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics' 
             AND  counter_name = 'SQL Re-Compilations/sec' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'Memory Manager' 
             AND  counter_name = 'Memory Grants Pending' 
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics' 
             AND  counter_name = 'Batch Requests/sec'  
           ) 
         OR ( OBJECT_NAME = @CounterPrefix + 'SQL Statistics' 
             AND  counter_name = 'SQL Compilations/sec'  
           ) 
 
-- Calculate the cumulative counter values 
SELECT  i .OBJECT_NAME , 
        i.counter_name , 
        i.instance_name , 
         CASE WHEN i.cntr_type  = 272696576  
                THEN s.cntr_value - i.cntr_value  
             WHEN i.cntr_type  = 65792 THEN s.cntr_value  
         END  AS cntr_value 
FROM    #perf_counters_init  AS i 
         JOIN #perf_counters_second AS s 
              ON i.collection_instance  + 1  = s.collection_instance 
                  AND  i.OBJECT_NAME = s.OBJECT_NAME 
                  AND  i.counter_name = s.counter_name  
                  AND  i.instance_name = s.instance_name 
ORDER  BY OBJECT_NAME 
 
-- Cleanup tables  
DROP TABLE  #perf_counters_init  
DROP TABLE  #perf_counters_second

收集的計(jì)數(shù)器包括：

·     SQLServer:Access Methods\Full Scans/sec
·     SQLServer:Access Methods\Index Searches/sec
·     SQLServer:Buffer Manager\Lazy Writes/sec
·     SQLServer:Buffer Manager\Page life expectancy
·     SQLServer:Buffer Manager\Free list stalls/sec
·     SQLServer:General Statistics\Processes Blocked
·     SQLServer:General Statistics\User Connections
·     SQLServer:Locks\Lock Waits/sec
·     SQLServer:Locks\Lock Wait Time (ms)
·     SQLServer:Memory Manager\Memory Grants Pending
·     SQLServer:SQL Statistics\Batch Requests/sec
·     SQLServer:SQL Statistics\SQL Compilations/sec
·     SQLServer:SQL Statistics\SQL Re-Compilations/sec

Index Searches比Full Scans高個800~1000比較理想。對于大于4Gb RAM的機(jī)器,，Page life expectancy=(RAM/4)*300.

計(jì)劃高速緩存的使用(Plan Cache Usage)

在SQL Server2005&2008中可以使用sys.dm_exec_query_stats查詢得到各種存于計(jì)劃高速緩存中的執(zhí)行信息,。結(jié)合其它的DMV&DMF可以得到各種TOP(N)信息。如：top 10 Physical Reads

SELECT TOP  10  
        execution_count  , 
        statement_start_offset  AS stmt_start_offset , 
         sql_handle  , 
        plan_handle  , 
        total_logical_reads / execution_count  AS avg_logical_reads , 
        total_logical_writes  / execution_count  AS avg_logical_writes  , 
        total_physical_reads  / execution_count  AS avg_physical_reads  , 
        t.text 
FROM    sys .dm_exec_query_stats AS s 
         CROSS  APPLY  sys .dm_exec_sql_text(s.sql_handle ) AS t 
ORDER  BY avg_physical_reads  DESC

總結(jié)

性能診斷最忌片面,，千萬不要只見樹林不見森林,。這不僅是一門手藝，更是一門藝術(shù),。

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購買等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：看見就非常 > 《server》

舉報(bào)/認(rèn)領(lǐng)