ORA-600(kfnsBackground03)錯誤
客戶的數據庫出現了ORA-600(kfnsBackground03)錯誤。
數據庫版本為10.2.0.3 RAC for HP-UX 11.23。這個錯誤在ASM實例和數據庫實例都可能出現,如果發生在ASM實例,并不會導致ASM實例的崩潰,但是如果發生在數據庫實例,則會導致數據庫實例被強制關閉:
Tue May 15 10:28:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:32:50 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:33:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:34:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:43:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:46:13 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_18846.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May tb 15 10:46:14 2012
Trace dumping is performing id=[cdmp_20120515104614]
上面是ASM實例的報錯,下面是對應時刻數據庫實例的報錯:
Tue May 15 10:38:12 2012
kkjcre1p: unable to spawn jobq slave process
Tue May 15 10:38:12 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_cjq0_17957.trc:
Tue May 15 10:42:19 2012
PMON failed to acquire latch, see PMON dump
Tue May 15 10:43:04 2012
found dead shared server 'S006', pid = (90, 4)
Tue May 15 10:43:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j000_19938.trc:
ORA-12012: error on auto execute of job 42579
ORA-27468: "EXFSYS.RLM$EVTCLEANUP" is locked by another process
Tue May 15 10:45:06 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j002_23628.trc:
ORA-12012: error on auto execute of job 8888975
ORA-27468: "ORCL.P_DATA_C" is locked by another process
Tue May 15 10:45:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j003_23959.trc:
ORA-12012: error on auto execute of job 8855572
ORA-27468: "ORCL.P_DATA" is locked by another process
Tue May 15 10:46:14 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_18844.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May 15 10:46:14 2012
ASMB: terminating instance due to error 15064
Tue May 15 10:46:15 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_17903.trc
Tue May 15 10:46:16 2012
Shutting down instance (abort)
License high water mark = 52
如果從這次數據庫的實例崩潰看,問題似乎和主機上的資源耗盡有關。在問題發生之前,數據庫實例已經出現了kkjcre1p: unable to spawn jobq slave process和PMON failed to acquire latch的問題。
當時其他時刻出現這個錯誤時,似乎并沒有確定的資源不足的信息:
Sat May 26 09:47:49 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:49:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:52:23 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_21722.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:25 2012
Trace dumping is performing id=[cdmp_20120526095225]
對應這個時刻的數據庫告警信息為:
Sat May 26 09:52:24 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_21720.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:24 2012
ASMB: terminating instance due to error 15064
Sat May 26 09:52:25 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_20837.trc
Sat May 26 09:52:26 2012
Shutting down instance (abort)
License high water mark = 46
Sat May 26 09:52:30 2012
Instance terminated by ASMB, pid = 21720
Sat May 26 09:52:31 2012
Instance terminated by USER, pid = 536
這次錯誤的出現并沒有任何其他的信息,數據庫實例就直接DOWN掉了。不過每次在出現這個錯誤時,ASM實例上都會存在告警信息:NOTE: database ORCL1:ORCL failed during msg 19, reply 2。這說明ASM實例和數據庫的通信存在了問題。kfnsBackground是Kernel Files Network Service Background的縮寫。其中MSG 19是指IOSTAT,而reply 2指的是TIMEOUT,這說明ASM在進行io操作是出現了timeout導致了ASM的異常并導致實例的崩潰。
這個錯誤相對比較罕見,整個METALINK中,只有3篇文章和這個錯誤相關,其中兩篇是和歸檔路徑空間不足導致系統HANG住,最終導致IO的TIMEOUT,并產生了錯誤;而另外一篇則沒有進一步的信息。其中這三次錯誤對應的版本分別是10.2.0.4 FOR AIX、10.2.0.4 FOR SOLARIS和10.2.0.3 FOR HPUX,這說明這個錯誤和平臺沒有關系,但是問題集中在10.2.0.3和10.2.0.4版本上。
根據上面的分析,應該部署操作系統信息監控工具,以便于隨時觀察系統資源的使用情況,在出現類似的錯誤可以進行輔助分析。由于這個問題沒有出現在10.2.0.5中的記錄,因此把數據庫升級到10.2.0.5有可能避開這個問題。