RAC node reboot for no apparent reason
DBAsupport.com Forums - Powered by vBulletin
Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: RAC node reboot for no apparent reason

Hybrid View

  1. #1
    Join Date
    Jan 2004
    Location
    Bangalore, India
    Posts
    66

    RAC node reboot for no apparent reason

    Hello all,

    I saw a strange RAC behaviour yesterday afternoon wherein one node of a two-node cluster just rebooted for no apparent reason. Nothing appears in /var/log/messages around the time server restarted. However, "fsck" was run upon restart. Would this be a system crash due to some hardware error? If so, how can I find out the reason?
    I looked up Oracle logs (starting from alert.log) and could not make out anything except regular messages of startup sequences. O2CB is configured with a heartbeat of 60.
    What is even more strange is that once the first instance detected unavailability of node2, the cluster did not shut down, but continued normally. Obviously, when node2 was back, node1 promptly did instance recovery and everything was back to normal.

    Can anyone guide me how to locate or determine the reason behind node2 shutdown and also why clusterware did not shutdown cluster (read surviving instance)?

    Hardware: HP
    OS: RHES Linux 4 U4, x86_64
    Oracle: 10.2.0.3, x86_64

    Any further information would be gladly given (obviously )

    Thanks in advance and regards,
    Suhas

  2. #2
    Join Date
    Sep 2002
    Location
    England
    Posts
    7,331
    are you sure it didnt ASR as opposed to reboot?

  3. #3
    Join Date
    Jan 2004
    Location
    Bangalore, India
    Posts
    66
    Forgive my ignorance davey23uk, but what is ASR ?

    Regards,
    Suhas

  4. #4
    Join Date
    Sep 2002
    Location
    England
    Posts
    7,331
    Automatic system reboot - if the machine hangs the bios will detect it and automatically reboot the server after X minutes.

    check the ilo event log for that

  5. #5
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    check clusterware logs

  6. #6
    Join Date
    Jan 2004
    Location
    Bangalore, India
    Posts
    66
    I have checked iLO event logs and there's no indication.

    Also checked clusterware logs and there are only startup messages and previous normal messages. The time-gap between shut-down and restart has not been captured by any log as far as my knowledge goes.

    Thanks!
    Suhas

  7. #7
    Join Date
    Jan 2004
    Location
    Bangalore, India
    Posts
    66
    Hey everyone,

    I noticed that NTP had somehow screwed up in node2 and it was lagging node1 by 1 hour!
    Do you think that that might have been the cause of node eviction?
    Here's an excerpt from node1 ocssd.log:
    [ CSSD]2008-06-23 13:37:04.609 [1241577824] >WARNING: clssnmPollingThread: node evg60lx-stnet-oracle-rac2 (2) at 50% heartbeat fatal, eviction in 29.880 seconds
    [ CSSD]2008-06-23 13:37:05.611 [1241577824] >WARNING: clssnmPollingThread: node evg60lx-stnet-oracle-rac2 (2) at 50% heartbeat fatal, eviction in 28.880 seconds
    .
    .
    .
    [ CSSD]2008-06-23 13:37:34.499 [1241577824] >TRACE: clssnmPollingThread: node evg60lx-stnet-oracle-rac2 (2) is impending reconfig
    [ CSSD]2008-06-23 13:37:34.499 [1241577824] >TRACE: clssnmPollingThread: Eviction started for node evg60lx-stnet-oracle-rac2 (2), flags 0x000d, state 3, wt4c 0

    Node2 reboot time shows 12:38:45. So, would the out-of-sync time on the two nodes be the reason for heartbeat failure?

    Any suggestions/hints would be highly appreciated.

    Thanks again!
    Suhas

  8. #8
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    looks like the servers lost interconnect communication between them

  9. #9
    Join Date
    Jan 2004
    Location
    Bangalore, India
    Posts
    66
    Hello,

    I found the following entries in /var/log/messages:

    Node1:

    Jun 23 13:34:29 evg60lx-stnet-oracle-rac1 su(pam_unix)[11130]: session opened for user oracle by (uid=0)
    Jun 23 13:34:29 evg60lx-stnet-oracle-rac1 su(pam_unix)[11130]: session closed for user oracle
    Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 kernel: o2net: connection to node evg60lx-stnet-oracle-rac2 (num 1) at 172.18.0.252:7777 has been idle for 10.0 seconds, shutting it down.
    Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 kernel: (0,0)2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1214246060.533338 now 1214246070.533571 dr 1214246060.533329 adv 1214246060.533339:1214246060.533339 func (bc55c001:504) 1214238801.75265:1214238801.75271)
    Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 kernel: o2net: no longer connected to node evg60lx-stnet-oracle-rac2 (num 1) at 172.18.0.252:7777
    Jun 23 13:34:30 evg60lx-stnet-oracle-rac1 su(pam_unix)[11179]: session opened for user oracle by (uid=0)

    Node2:

    Jun 23 13:34:30 evg60lx-stnet-oracle-rac2 su(pam_unix)[5422]: session closed for user oracle
    Jun 23 13:34:30 evg60lx-stnet-oracle-rac2 kernel: o2net: no longer connected to node evg60lx-stnet-oracle-rac1 (num 0) at 172.18.0.250:7777
    Jun 23 13:34:31 evg60lx-stnet-oracle-rac2 su(pam_unix)[5471]: session opened for user oracle by (uid=0)

    I am not sure what this means (especially o2net). I am looking into it now.
    Could you please shed more light on the above messages?

    Thanks again for the kind help.

    Suhas

  10. #10
    Join Date
    Mar 2001
    Location
    Ireland/Dublin
    Posts
    684
    You have a problem with your ClusterWare configuration.
    Boot server with a different kernel, /etc/init.d/init.cssd - does reboot for you
    Best wishes!
    Dmitri

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width