楊廷琨(yangtingkun)
云和恩墨 CTO 高級(jí)咨詢顧問,,Oracle ACE 總監(jiān),ITPUB Oracle 數(shù)據(jù)庫管理版版主 編輯手記:在很多數(shù)據(jù)庫的故障案例中,,一個(gè)簡(jiǎn)單的疏忽可能導(dǎo)致問題被層層放大,,最終導(dǎo)致故障,這就是蝴蝶效應(yīng)的傳播原理,。這里分享的小案例自頂向下的追溯可以顯見:實(shí)例掛起->歸檔失敗->實(shí)例錯(cuò)誤->參數(shù)配置,。根本的原因往往很簡(jiǎn)單,DBA的嚴(yán)謹(jǐn)尤其重要,。 客戶的11.2.0.3 RAC數(shù)據(jù)庫出現(xiàn)了歸檔失敗的情況,,導(dǎo)致單個(gè)實(shí)例出現(xiàn)HANG死的狀況,。 檢查錯(cuò)誤信息發(fā)現(xiàn): Tue Jul 02 16:49:13 2013 ARC1: Error 19504 Creating archive log file TO '+DATA02' ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance orcl1 - Archival Error ORA-16038: log 14 SEQUENCE# 68244 cannot be archived ORA-19504: failed TO CREATE file "" ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.264.792274883' ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.265.792274889' Archiver process freed FROM errors. No longer stopped Tue Jul 02 16:50:37 2013 ARC0: LGWR IS actively archiving destination LOG_ARCHIVE_DEST_3 ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance orcl1 - Archival Error ORA-16014: log 14 SEQUENCE# 68244 NOT archived, no available destinations ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.264.792274883' ORA-00312: online log 14 thread 1: '+DATA02/orcl/onlinelog/group_14.265.792274889' ARC0: Archive log rejected (thread 1 SEQUENCE 68240) at host 'orclsh' FAL[server, ARC0]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance orcl1 - Archival Error. Archiver continuing.
由于歸檔失敗發(fā)生在ASM磁盤上,首先檢查ASM磁盤空間以及DB_RECOVERY_FILE_DEST_SIZE,,ASM磁盤空間是足夠的,,而且由于只有一個(gè)節(jié)點(diǎn)出現(xiàn)出現(xiàn)了無法歸檔的問題,也可以排除是空間不足造成的,。 確認(rèn)兩個(gè)節(jié)點(diǎn)的DB_RECOVERY_FILE_DEST_SIZE參數(shù)設(shè)置都是0,,基本上可以判斷問題和當(dāng)前節(jié)點(diǎn)的ASM實(shí)例狀態(tài)不正常有關(guān)。 接下來檢查ASM實(shí)例的錯(cuò)誤信息: Tue Jul 02 16:41:43 2013 Dumping diagnostic DATA IN directory=[cdmp_20130702164115], requested BY (instance=2, osid=2032294 (LMD0)), summary=[incident=165521]. Tue Jul 02 16:49:13 2013 Dumping diagnostic DATA IN directory=[cdmp_20130702164845], requested BY (instance=2, osid=2032294 (LMD0)), summary=[incident=165522].
當(dāng)前節(jié)點(diǎn)ASM實(shí)例出現(xiàn)了的這個(gè)信息,,說明報(bào)錯(cuò)發(fā)生在實(shí)例2上,,檢查實(shí)例2的ASM告警日志文件: Tue Jul 02 18:29:55 2013 Errors IN file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_lmd0_2032294.trc (incident=186256): ORA-04031: unable TO allocate 3768 bytes OF shared memory ("shared pool","unknown object","sga heap(1,0)","ges enqueues") USE ADRCI OR Support Workbench TO package the incident. See Note 411.1 at My Oracle Support FOR error AND packaging details. Insufficient shared pool TO allocate a GES object (ospid 2032294) Tue Jul 02 18:29:55 2013 Sweep [inc][186256]: completed Tue Jul 02 18:36:49 2013 Errors IN file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_lmd0_2032294.trc (incident=186257): ORA-04031: unable TO allocate 3768 bytes OF shared memory ("shared pool","unknown object","sga heap(1,0)","ges enqueues") USE ADRCI OR Support Workbench TO package the incident. See Note 411.1 at My Oracle Support FOR error AND packaging details. Insufficient shared pool TO allocate a GES object (ospid 2032294)
果然實(shí)例2上的ASM出現(xiàn)了大量ORA-4031錯(cuò)誤,這說明ASM實(shí)例的Shared Pool內(nèi)存可能配置不足,。檢查ASM啟動(dòng)的參數(shù)配置: Sat Aug 25 20:06:55 2012 NOTE: No asm libraries found IN the system ERROR: -5(Duplicate disk DATA_DG01:ASM_DISK1) ERROR: -5(Duplicate disk DATA_DG01:ASM_DISK2) MEMORY_TARGET defaulting TO 411041792. * instance_number obtained FROM CSS = 2, checking FOR the existence OF node 0... * node 0 does NOT exist. instance_number = 2 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Private Interface 'en1' configured FROM GPnP FOR USE AS a private interconnect. [name='en1', TYPE=1, ip=169.254.78.6, mac=00-1a-64-bb-50-7d, net=169.254.0.0/16, mask=255.255.0.0, USE=haip:cluster_interconnect/62] Public Interface 'en0' configured FROM GPnP FOR USE AS a public interface. [name='en0', TYPE=1, ip=10.1.16.35, mac=00-1a-64-bb-50-7c, net=10.1.16.32/27, mask=255.255.255.224, USE=public/1] Picked latch-free SCN scheme 3 USING LOG_ARCHIVE_DEST_1 parameter DEFAULT VALUE AS /u01/app/11.2.0.3/grid/dbs/arch Autotune OF undo retention IS turned ON. LICENSE_MAX_USERS = 0 SYS auditing IS disabled NOTE: Volume support enabled Starting up: Oracle DATABASE 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production WITH the REAL Application Clusters AND Automatic Storage Management options. ORACLE_HOME = /u01/app/11.2.0.3/grid System name: AIX Node name: orcldb2 Release: 1 Version: 6 Machine: 00C94E064C00 USING parameter settings IN server-side pfile /u01/app/11.2.0.3/grid/dbs/init+ASM2.ora System parameters WITH non-DEFAULT VALUES: large_pool_size = 12M instance_type = "asm" remote_login_passwordfile= "EXCLUSIVE" asm_diskstring = "/dev/ocr_*" asm_diskstring = "/dev/voting_*" asm_diskstring = "/dev/asm_*" asm_diskgroups = "DATA" asm_diskgroups = "DATA_DG01" asm_diskgroups = "SPFILE_DG" asm_power_limit = 1 diagnostic_dest = "/u01/app/grid" Cluster communication IS configured TO USE the following interface(s) FOR this instance 169.254.78.6 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2
當(dāng)前ASM實(shí)例使用默認(rèn)的MEMORY_TARGET配置,,分配大小大約是400M,根據(jù)Oracle的MOS文章:ASM & Shared Pool (ORA-4031) [ID 437924.1],,在11.2.0.3中,,Oracle增加了ASM實(shí)例所允許的默認(rèn)進(jìn)程數(shù)PROCESSES,但是默認(rèn)的MEMORY_TARGET參數(shù)沒有增加,。 根據(jù)Oracle的建議,,11.2.0.3的MEMORY_TARGET至少應(yīng)該設(shè)置到1536M,而MEMORY_MAX_TARGET設(shè)置為4096M,。 SQL> ALTER system SET memory_max_target=4096m scope=spfile; SQL> ALTER system SET memory_target=1536m scope=spfile;
對(duì)于當(dāng)前的情況,,如果短時(shí)間內(nèi)無法重啟DB和ASM實(shí)例,可以在問題節(jié)點(diǎn)配置一個(gè)第二本地歸檔路徑,,設(shè)置目標(biāo)路徑為本地磁盤,,從而避免歸檔無法完成而導(dǎo)致的實(shí)例HANG死。
|