Root cause for node eviction needed
Hi,
Could you please help us to find out the root cause analysis on why orinoco1 server rebooted (CRS node eviction).
below is the error message
Jun 27 17:51:59 orinoco1 logger: Oracle CSSD failure 134.
Jun 27 17:51:59 orinoco1 logger: Oracle CRS failure. Rebooting for cluster integrity.
Jun 27 17:52:00 orinoco1 logger: Oracle clsomon failed with fatal status 12.
Jun 27 17:52:00 orinoco1 logger: Oracle CRS failure. Rebooting for cluster integrity.
====== OCCSD LOG ==================================================================================================== ========
[ CSSD]2011-06-27 17:43:21.978 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x7ac510) proc(0x777f60) pid() proto(10:2:1:1)
[ CSSD]2011-06-27 17:43:45.328 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x7af170) proc(0x773c10) pid() proto(10:2:1:1)
[ CSSD]2011-06-27 17:44:45.678 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x7af170) proc(0x773c10) pid() proto(10:2:1:1)
[ CSSD]2011-06-27 17:45:02.940 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x7af170) proc(0x773c10) pid(11998) proto(10:2:1:1)
[ CSSD]2011-06-27 17:45:16.233 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x7a2d80) proc(0x77a900) pid(12822) proto(10:2:1:1)
[ CSSD]2011-06-27 17:45:45.970 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x77abf0) proc(0x777e60) pid() proto(10:2:1:1)
[ CSSD]2011-06-27 17:46:46.330 [1199618400] >TRACE: clssgmClientConnectMsg: Connect from con(0x77abf0) proc(0x777e60) pid() proto(10:2:1:1)
[ CSSD]2011-06-27 17:50:21.821 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 50% heartbeat fatal, eviction in 29.560 seconds
[ CSSD]2011-06-27 17:50:22.823 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 50% heartbeat fatal, eviction in 28.560 seconds
[ CSSD]2011-06-27 17:50:36.831 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 75% heartbeat fatal, eviction in 14.550 seconds
[ CSSD]2011-06-27 17:50:37.823 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 75% heartbeat fatal, eviction in 13.560 seconds
[ CSSD]2011-06-27 17:50:45.829 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 90% heartbeat fatal, eviction in 5.560 seconds
[ CSSD]2011-06-27 17:50:46.831 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 90% heartbeat fatal, eviction in 4.560 seconds
[ CSSD]2011-06-27 17:50:47.833 [1241577824] >TRACE: clssnmPollingThread: node orinoco2 (2) is impending reconfig
[ CSSD]2011-06-27 17:50:47.833 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 90% heartbeat fatal, eviction in 3.550 seconds
[ CSSD]2011-06-27 17:50:48.825 [1241577824] >TRACE: clssnmPollingThread: node orinoco2 (2) is impending reconfig
[ CSSD]2011-06-27 17:50:48.825 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 90% heartbeat fatal, eviction in 2.560 seconds
[ CSSD]2011-06-27 17:50:49.827 [1241577824] >TRACE: clssnmPollingThread: node orinoco2 (2) is impending reconfig
[ CSSD]2011-06-27 17:50:49.827 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 90% heartbeat fatal, eviction in 1.560 seconds
[ CSSD]2011-06-27 17:50:50.829 [1241577824] >TRACE: clssnmPollingThread: node orinoco2 (2) is impending reconfig
[ CSSD]2011-06-27 17:50:50.829 [1241577824] >WARNING: clssnmPollingThread: node orinoco2 (2) at 90% heartbeat fatal, eviction in 0.560 seconds
==================================================================================================== ==========================
====== /var/log/messages ==================================================================================================== ========
Jun 27 17:45:01 orinoco1 su(pam_unix)[11911]: session opened for user oracle by (uid=0)
Jun 27 17:45:01 orinoco1 su(pam_unix)[11911]: session closed for user oracle
Jun 27 17:47:40 orinoco1 kernel: bnx2: eth0 NIC Link is Down
Jun 27 17:47:41 orinoco1 kernel: LLT INFO V-14-1-10205 link 2 (eth0) node 0 in trouble
Jun 27 17:47:41 orinoco1 kernel: LLT INFO V-14-1-10205 link 2 (eth0) node 2 in trouble
Jun 27 17:47:41 orinoco1 kernel: LLT INFO V-14-1-10205 link 2 (eth0) node 5 in trouble
Jun 27 17:47:41 orinoco1 kernel: LLT INFO V-14-1-10205 link 2 (eth0) node 1 in trouble
Jun 27 17:47:41 orinoco1 kernel: LLT INFO V-14-1-10205 link 2 (eth0) node 3 in trouble
Jun 27 17:47:43 orinoco1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex
Jun 27 17:47:44 orinoco1 kernel: LLT INFO V-14-1-10024 link 2 (eth0) node 0 active
Jun 27 17:47:44 orinoco1 kernel: LLT INFO V-14-1-10024 link 2 (eth0) node 2 active
Jun 27 17:47:44 orinoco1 kernel: LLT INFO V-14-1-10024 link 2 (eth0) node 5 active
Jun 27 17:47:44 orinoco1 kernel: LLT INFO V-14-1-10024 link 2 (eth0) node 1 active
Jun 27 17:47:44 orinoco1 kernel: LLT INFO V-14-1-10024 link 2 (eth0) node 3 active
Jun 27 17:47:45 orinoco1 kernel: o2net: connection to node orinoco2 (num 1) at 199.40.40.234:7777 has been idle for 10.0 seconds, shutting it down.
Jun 27 17:47:45 orinoco1 kernel: (0,0):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1309168055.597322 now 1309168065.596662 dr 1309168055.597308 adv 1309168055.597329:1309168055.597330 func (d5542a8e:504) 1309168035.598570:1309168035.598693)
Jun 27 17:47:45 orinoco1 kernel: o2net: no longer connected to node orinoco2 (num 1) at 199.40.40.234:7777
==================================================================================================== ==========================
From the above messages we confirmed that, server has been rebooted to keep cluster integrity due to network interface failure logged in /var/log/mesages...
But can somebody confirm if this is due to :-
i) private interconnect network failure
or
ii) vote disk issue
Please also confirm that this is not due to glibc bug which causes random eviction. Note that O/S is running on Red Hat Enterprise Linux AS release 4 (Nahant Update 4) with 2.6.9-42.ELsmp.
Glibc : glibc-2.3.4-2.25
thanks