RAC node reboot for no apparent reason
I saw a strange RAC behaviour yesterday afternoon wherein one node of a two-node cluster just rebooted for no apparent reason. Nothing appears in /var/log/messages around the time server restarted. However, "fsck" was run upon restart. Would this be a system crash due to some hardware error? If so, how can I find out the reason?
I looked up Oracle logs (starting from alert.log) and could not make out anything except regular messages of startup sequences. O2CB is configured with a heartbeat of 60.
What is even more strange is that once the first instance detected unavailability of node2, the cluster did not shut down, but continued normally. Obviously, when node2 was back, node1 promptly did instance recovery and everything was back to normal.
Can anyone guide me how to locate or determine the reason behind node2 shutdown and also why clusterware did not shutdown cluster (read surviving instance)?
OS: RHES Linux 4 U4, x86_64
Oracle: 10.2.0.3, x86_64
Any further information would be gladly given (obviously )
Thanks in advance and regards,
are you sure it didnt ASR as opposed to reboot?
Forgive my ignorance davey23uk, but what is ASR ?
Automatic system reboot - if the machine hangs the bios will detect it and automatically reboot the server after X minutes.
check the ilo event log for that
I have checked iLO event logs and there's no indication.
Also checked clusterware logs and there are only startup messages and previous normal messages. The time-gap between shut-down and restart has not been captured by any log as far as my knowledge goes.
I noticed that NTP had somehow screwed up in node2 and it was lagging node1 by 1 hour!
Do you think that that might have been the cause of node eviction?
Here's an excerpt from node1 ocssd.log:
[ CSSD]2008-06-23 13:37:04.609  >WARNING: clssnmPollingThread: node evg60lx-stnet-oracle-rac2 (2) at 50% heartbeat fatal, eviction in 29.880 seconds
[ CSSD]2008-06-23 13:37:05.611  >WARNING: clssnmPollingThread: node evg60lx-stnet-oracle-rac2 (2) at 50% heartbeat fatal, eviction in 28.880 seconds
[ CSSD]2008-06-23 13:37:34.499  >TRACE: clssnmPollingThread: node evg60lx-stnet-oracle-rac2 (2) is impending reconfig
[ CSSD]2008-06-23 13:37:34.499  >TRACE: clssnmPollingThread: Eviction started for node evg60lx-stnet-oracle-rac2 (2), flags 0x000d, state 3, wt4c 0
Node2 reboot time shows 12:38:45. So, would the out-of-sync time on the two nodes be the reason for heartbeat failure?
Any suggestions/hints would be highly appreciated.
looks like the servers lost interconnect communication between them
I found the following entries in /var/log/messages:
Jun 23 13:34:29 evg60lx-stnet-oracle-rac1 su(pam_unix): session opened for user oracle by (uid=0)
Jun 23 13:34:29 evg60lx-stnet-oracle-rac1 su(pam_unix): session closed for user oracle
Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 kernel: o2net: connection to node evg60lx-stnet-oracle-rac2 (num 1) at 172.18.0.252:7777 has been idle for 10.0 seconds, shutting it down.
Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 kernel: (0,0)2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1214246060.533338 now 1214246070.533571 dr 1214246060.533329 adv 1214246060.533339:1214246060.533339 func (bc55c001:504) 1214238801.75265:1214238801.75271)
Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 kernel: o2net: no longer connected to node evg60lx-stnet-oracle-rac2 (num 1) at 172.18.0.252:7777
Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 su(pam_unix): session opened for user oracle by (uid=0)
Jun 23 13:34:30 evg60lx-stnet-oracle-rac2 su(pam_unix): session closed for user oracle
Jun 23 13:34:30 evg60lx-stnet-oracle-rac2 kernel: o2net: no longer connected to node evg60lx-stnet-oracle-rac1 (num 0) at 172.18.0.250:7777
Jun 23 13:34:31 evg60lx-stnet-oracle-rac2 su(pam_unix): session opened for user oracle by (uid=0)
I am not sure what this means (especially o2net). I am looking into it now.
Could you please shed more light on the above messages?
Thanks again for the kind help.
You have a problem with your ClusterWare configuration.
Boot server with a different kernel, /etc/init.d/init.cssd - does reboot for you
Click Here to Expand Forum to Full Width