DBAsupport.com Forums - Powered by vBulletin
Results 1 to 5 of 5

Thread: rac instance died on linux cluster

  1. #1

    Angry rac instance died on linux cluster

    Hi,
    I have a rac 9.2.0.2 Database running on RedHat AS2.1,it has been up and running for about 4 monthes(One month ago, i patched it to 9202).
    Last night, one instance died unexpectedly,while another instance still running.Though not much business is affected, I want to know why it died, But i am unable to find it out, so looking for your help.
    Here is some information,
    I tested the interconnect and the service Network card, both running fine(at 7:00 am), and the disk system is also ok.

    Alert log file:
    Fri Dec 20 23:38:24 2002
    Thread 2 advanced to log sequence 265
    Current log# 3 seq# 265 mem# 0: /dev/raw/raw8
    Sat Dec 21 04:04:29 2002
    Errors in file /home/oracle/admin/rac/bdump/rac2_lmon_1634.trc:
    ORA-29740: evicted by member 0, group incarnation 7
    Sat Dec 21 04:04:29 2002
    LMON: terminating instance due to error 29740
    Sat Dec 21 04:04:31 2002
    Trace dumping is performing id=[cdmp_20021221040431]
    Sat Dec 21 04:04:34 2002
    Instance terminated by LMON, pid = 1634
    Sat Dec 21 07:38:36 2002
    Starting ORACLE instance (normal)
    Sat Dec 21 07:38:36 2002
    and this is from the trace file:


    [oracle@rac2 bdump]$ cat /home/oracle/admin/rac/bdump/rac2_lmon_1634.trc
    /home/oracle/admin/rac/bdump/rac2_lmon_1634.trc
    Oracle9i Enterprise Edition Release 9.2.0.2.0 - Production
    With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.2.0 - Production
    ORACLE_HOME = /home/oracle/9.2.0
    System name: Linux
    Node name: rac2
    Release: 2.4.9-e.3smp
    Version: #1 SMP Fri May 3 16:48:54 EDT 2002
    Machine: i686
    Instance name: rac2
    Redo thread mounted by this instance: 0
    Oracle process number: 4
    Unix process pid: 1634, image: oracle@rac2 (LMON)

    *** SESSION ID:(3.1) 2002-11-22 03:29:38.649
    Batch msg size = 2048
    Batching factor: enqueue replay 48, ack 53
    Batching factor: cache replay 34 size per lock 56
    kjxggin: receive buffer size = 32768
    kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
    CMCLI WARNING: CMInitContext: init ctx(0xacc37e8)
    *** 2002-11-22 03:29:42.243
    kjxgmrcfg: Reconfiguration started, reason 1
    kjxgmcs: Setting state to 0 0.
    *** 2002-11-22 03:29:42.243
    Name Service frozen
    kjxgmcs: Setting state to 0 1.
    kjfcpiora: publish my weight 152022
    kjxgmps: proposing substate 2
    kjxgmcs: Setting state to 6 2.
    Performed the unique instance identification check
    kjxgmps: proposing substate 3
    kjxgmcs: Setting state to 6 3.
    Name Service recovery started
    Deleted all dead-instance name entries
    kjxgmps: proposing substate 4
    kjxgmcs: Setting state to 6 4.
    Multicasted all local name entries for publish
    Replayed all pending requests
    kjxgmps: proposing substate 5
    kjxgmcs: Setting state to 6 5.
    Name Service normal
    Name Service recovery done
    *** 2002-11-22 03:29:43.397
    kjxgmps: proposing substate 6
    kjxgmcs: Setting state to 6 6.
    *** 2002-11-22 03:29:43.507
    *** 2002-11-22 03:29:43.508
    Reconfiguration started
    Synchronization timeout interval: 660 sec
    List of nodes: 0,1,
    Global Resource Directory frozen
    node 0
    release 9 2 0 2
    node 1
    release 9 2 0 2
    res_master_weight for node 0 is 152022
    res_master_weight for node 1 is 152022
    Total master weight = 304044
    Dead inst
    Join inst 0 1
    Exist inst
    Active Sendback Threshold = 50 %
    Communication channels reestablished
    Master broadcasted resource hash value bitmaps
    Non-local Process blocks cleaned out
    Resources and enqueues cleaned out
    Resources remastered 0
    0 GCS shadows traversed, 0 cancelled, 0 closed
    0 GCS resources traversed, 0 cancelled
    set master node info
    Submitted all remote-enqueue requests
    kjfcrfg: Number of mesgs sent to node 0 = 0
    Update rdomain variables
    Dwn-cvts replayed, VALBLKs dubious
    All grantable enqueues granted
    *** 2002-11-22 03:29:43.868
    0 GCS shadows traversed, 0 replayed, 0 unopened
    Submitted all GCS cache requests
    0 write requests issued in 887 GCS resources
    0 PIs marked suspect, 0 flush PI msgs
    *** 2002-11-22 03:29:44.116
    Reconfiguration complete
    *** 2002-11-22 03:29:51.261
    kjxgrtmc2: Member 1 thread 2 mounted
    *** 2002-12-21 04:02:05.645
    kjxgrgetresults: Detect reconfig from 0, seq 6, reason 2
    *** 2002-12-21 04:01:57.014
    kjxgrrcfgchk: Initiating reconfig, reason 2
    *** 2002-12-21 04:01:57.014
    kjxgmrcfg: Reconfiguration started, reason 2
    kjxgmcs: Setting state to 6 0.
    *** 2002-12-21 04:01:57.021
    Name Service frozen
    kjxgmcs: Setting state to 6 1.
    *** 2002-12-21 04:04:29.911
    kjxgrdtrt: Evicted by 0, seq (7, 6)
    error 29740 detected in background process
    ORA-29740: evicted by member 0, group incarnation 7
    ksuitm: waiting for [5] seconds before killing DIAG
    www.cnoug.org

  2. #2
    Join Date
    Dec 2002
    Posts
    9
    chao_ping,

    ORA-29740 is an error when a member is evicted from cluster group by another member.
    When you look at the trace file towards the bottom you will see the reason as reason 2 as :

    *** 2002-12-21 04:02:05.645
    kjxgrgetresults: Detect reconfig from 0, seq 6, reason 2
    *** 2002-12-21 04:01:57.014
    kjxgrrcfgchk: Initiating reconfig, reason 2
    *** 2002-12-21 04:01:57.014
    kjxgmrcfg: Reconfiguration started, reason 2
    kjxgmcs: Setting state to 6 0.
    *** 2002-12-21 04:01:57.021

    Reason 2 means that an instance death was detected. In other words this instance failed to issue a heartbeat to the CF. When LMON detects that an instance is not issueing a heartbeat,LMON will ping it, if there is a reponse, LNOM will consider the instance alive,
    but if the heartbeat is not issued within the threshold of the controlfile_enqueue_timeout usualy set internally to 900 sec (15 min), then the instance will be considered a problem and will be evicted. this is what happened in your case apprantly.

    look at the instance alert log. you could disable the eviction by setting a parameter to false, however, this is not recommended since it is a symptom pointing to another problem that needs addressing.

    check all the trace files and dumps and determine what caused the instance to fail and be evicted and you'll have you reason.
    ___________________________________________________________
    ORA-1578
    A nipple is the only intuitive interface, everything else needs a manual


    reason 2

  3. #3
    hi, friend:
    The same time the next day, another instance in the cluster died, with the same reason. ora-29740, still with reason 2.
    The cluster runs quite stable in the past month(since the patchset is installed, it is just about 30 days).
    When i check the linux /var/log/messages, i found at the exact same time, syslogd restarted in both node in the two days , when rac instance died. Whould there be some relations between them?Unix did not rebooted ,I checked uptime value.
    From the trace file, i found it said the dead instance failed to transfer heart beat:
    first day, from the alive instace rac1:
    *** 2002-12-21 04:01:54.227
    kjxgrnbrisalive: (1, 2) not beating, HB: 479418910, 479418910
    *** 2002-12-21 04:01:54.239
    kjxgrnbrdead: Detected death of 1, initiating reconfig
    kjxgrrcfgchk: Initiating reconfig, reason 2
    *** 2002-12-21 04:01:59.256
    kjxgmrcfg: Reconfiguration started, reason 2
    kjxgmcs: Setting state to 6 0.
    *** 2002-12-21 04:01:59.258
    Name Service frozen
    kjxgmcs: Setting state to 6 1.
    from the trace file of the second day, from the alive instance rac2:
    *** 2002-12-22 04:01:56.457
    kjxgrnbrisalive: (0, 1) not beating, HB: 479438832, 479438832
    *** 2002-12-22 04:01:56.457
    kjxgrnbrdead: Detected death of 0, initiating reconfig
    kjxgrrcfgchk: Initiating reconfig, reason 2
    *** 2002-12-22 04:02:01.486
    kjxgmrcfg: Reconfiguration started, reason 2
    kjxgmcs: Setting state to 9 0.
    *** 2002-12-22 04:02:01.495
    Name Service frozen
    www.cnoug.org

  4. #4
    Hi, friends:
    I wonder if anyone here have the experience of dealing with rac system. What shall i check to verify why rac instance failed to update the controlfile. I already enabled event:
    event="29740 trace name errorstack level 3"
    in one instance.
    shall i enable the undocumented parameter
    _imr_active=false in the system?
    www.cnoug.org

  5. #5
    I think i have found the answer.
    There is cron in /etc/cron.daily/rdate , which sync the system time with the time server.
    And this runs at the exact time that my rac instance die.
    I do not whether the time on the dead instance is set backworld or not, for i do not have the old value before time is sync with the time server.
    www.cnoug.org

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Click Here to Expand Forum to Full Width