rac instance died on linux cluster

**chao_ping** · 12-21-2002, 08:29 AM

Hi,
I have a rac 9.2.0.2 Database running on RedHat AS2.1,it has been up and running for about 4 monthes(One month ago, i patched it to 9202).
Last night, one instance died unexpectedly,while another instance still running.Though not much business is affected, I want to know why it died, But i am unable to find it out, so looking for your help.
Here is some information,
I tested the interconnect and the service Network card, both running fine(at 7:00 am), and the disk system is also ok.

Alert log file:

Fri Dec 20 23:38:24 2002
Thread 2 advanced to log sequence 265
Current log# 3 seq# 265 mem# 0: /dev/raw/raw8
Sat Dec 21 04:04:29 2002
Errors in file /home/oracle/admin/rac/bdump/rac2_lmon_1634.trc:
ORA-29740: evicted by member 0, group incarnation 7
Sat Dec 21 04:04:29 2002
LMON: terminating instance due to error 29740
Sat Dec 21 04:04:31 2002
Trace dumping is performing id=[cdmp_20021221040431]
Sat Dec 21 04:04:34 2002
Instance terminated by LMON, pid = 1634
Sat Dec 21 07:38:36 2002
Starting ORACLE instance (normal)
Sat Dec 21 07:38:36 2002

and this is from the trace file:

[oracle@rac2 bdump]$ cat /home/oracle/admin/rac/bdump/rac2_lmon_1634.trc
/home/oracle/admin/rac/bdump/rac2_lmon_1634.trc
Oracle9i Enterprise Edition Release 9.2.0.2.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.2.0 - Production
ORACLE_HOME = /home/oracle/9.2.0
System name: Linux
Node name: rac2
Release: 2.4.9-e.3smp
Version: #1 SMP Fri May 3 16:48:54 EDT 2002
Machine: i686
Instance name: rac2
Redo thread mounted by this instance: 0
Oracle process number: 4
Unix process pid: 1634, image: oracle@rac2 (LMON)

*** SESSION ID:(3.1) 2002-11-22 03:29:38.649
Batch msg size = 2048
Batching factor: enqueue replay 48, ack 53
Batching factor: cache replay 34 size per lock 56
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
CMCLI WARNING: CMInitContext: init ctx(0xacc37e8)
*** 2002-11-22 03:29:42.243
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2002-11-22 03:29:42.243
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 152022
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 6 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 6 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 6 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 6 5.
Name Service normal
Name Service recovery done
*** 2002-11-22 03:29:43.397
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 6 6.
*** 2002-11-22 03:29:43.507
*** 2002-11-22 03:29:43.508
Reconfiguration started
Synchronization timeout interval: 660 sec
List of nodes: 0,1,
Global Resource Directory frozen
node 0
release 9 2 0 2
node 1
release 9 2 0 2
res_master_weight for node 0 is 152022
res_master_weight for node 1 is 152022
Total master weight = 304044
Dead inst
Join inst 0 1
Exist inst
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2002-11-22 03:29:43.868
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 887 GCS resources
0 PIs marked suspect, 0 flush PI msgs
*** 2002-11-22 03:29:44.116
Reconfiguration complete
*** 2002-11-22 03:29:51.261
kjxgrtmc2: Member 1 thread 2 mounted
*** 2002-12-21 04:02:05.645
kjxgrgetresults: Detect reconfig from 0, seq 6, reason 2
*** 2002-12-21 04:01:57.014
kjxgrrcfgchk: Initiating reconfig, reason 2
*** 2002-12-21 04:01:57.014
kjxgmrcfg: Reconfiguration started, reason 2
kjxgmcs: Setting state to 6 0.
*** 2002-12-21 04:01:57.021
Name Service frozen
kjxgmcs: Setting state to 6 1.
*** 2002-12-21 04:04:29.911
kjxgrdtrt: Evicted by 0, seq (7, 6)
error 29740 detected in background process
ORA-29740: evicted by member 0, group incarnation 7
ksuitm: waiting for [5] seconds before killing DIAG

**ORA-1578** · 12-21-2002, 05:37 PM

chao_ping,

ORA-29740 is an error when a member is evicted from cluster group by another member.
When you look at the trace file towards the bottom you will see the reason as reason 2 as :

*** 2002-12-21 04:02:05.645
kjxgrgetresults: Detect reconfig from 0, seq 6, reason 2
*** 2002-12-21 04:01:57.014
kjxgrrcfgchk: Initiating reconfig, reason 2
*** 2002-12-21 04:01:57.014
kjxgmrcfg: Reconfiguration started, reason 2
kjxgmcs: Setting state to 6 0.
*** 2002-12-21 04:01:57.021

Reason 2 means that an instance death was detected. In other words this instance failed to issue a heartbeat to the CF. When LMON detects that an instance is not issueing a heartbeat,LMON will ping it, if there is a reponse, LNOM will consider the instance alive,
but if the heartbeat is not issued within the threshold of the controlfile_enqueue_timeout usualy set internally to 900 sec (15 min), then the instance will be considered a problem and will be evicted. this is what happened in your case apprantly.

look at the instance alert log. you could disable the eviction by setting a parameter to false, however, this is not recommended since it is a symptom pointing to another problem that needs addressing.

check all the trace files and dumps and determine what caused the instance to fail and be evicted and you'll have you reason.
___________________________________________________________
ORA-1578
A nipple is the only intuitive interface, everything else needs a manual

reason 2

**chao_ping** · 12-22-2002, 12:27 PM

hi, friend:
The same time the next day, another instance in the cluster died, with the same reason. ora-29740, still with reason 2.
The cluster runs quite stable in the past month(since the patchset is installed, it is just about 30 days).
When i check the linux /var/log/messages, i found at the exact same time, syslogd restarted in both node in the two days , when rac instance died. Whould there be some relations between them?Unix did not rebooted ,I checked uptime value.
From the trace file, i found it said the dead instance failed to transfer heart beat:
first day, from the alive instace rac1:

*** 2002-12-21 04:01:54.227
kjxgrnbrisalive: (1, 2) not beating, HB: 479418910, 479418910
*** 2002-12-21 04:01:54.239
kjxgrnbrdead: Detected death of 1, initiating reconfig
kjxgrrcfgchk: Initiating reconfig, reason 2
*** 2002-12-21 04:01:59.256
kjxgmrcfg: Reconfiguration started, reason 2
kjxgmcs: Setting state to 6 0.
*** 2002-12-21 04:01:59.258
Name Service frozen
kjxgmcs: Setting state to 6 1.

from the trace file of the second day, from the alive instance rac2:

*** 2002-12-22 04:01:56.457
kjxgrnbrisalive: (0, 1) not beating, HB: 479438832, 479438832
*** 2002-12-22 04:01:56.457
kjxgrnbrdead: Detected death of 0, initiating reconfig
kjxgrrcfgchk: Initiating reconfig, reason 2
*** 2002-12-22 04:02:01.486
kjxgmrcfg: Reconfiguration started, reason 2
kjxgmcs: Setting state to 9 0.
*** 2002-12-22 04:02:01.495
Name Service frozen

**chao_ping** · 12-22-2002, 12:32 PM

Hi, friends:
I wonder if anyone here have the experience of dealing with rac system. What shall i check to verify why rac instance failed to update the controlfile. I already enabled event:
event="29740 trace name errorstack level 3"
in one instance.
shall i enable the undocumented parameter
_imr_active=false in the system?

**chao_ping** · 12-23-2002, 01:13 AM

I think i have found the answer.
There is cron in /etc/cron.daily/rdate , which sync the system time with the time server.
And this runs at the exact time that my rac instance die.
I do not whether the time on the dead instance is set backworld or not, for i do not have the old value before time is sync with the time server.

Thread: rac instance died on linux cluster

Thread Tools

Display

rac instance died on linux cluster

Posting Permissions