ASM diskgroup problems

**Chucks_k** · 11-05-2008, 03:25 PM

O/S: RHEL 3
DB: 10.2.0.3

Hi all,

Our SAN storage failed with its zoning whilst the db was running. This has caused ASM to go into a panic. This is how the ASM storage looked before this problem occurred (the asm dg is setup as normal redundancy):

Code:

GROUP_NUMBER  DISK_NUMBER  COMPOUND_INDEX  INCARNATION  MOUNT_S  HEADER_STATU  MODE_ST  STATE  REDUNDA  LIBRARY  TOTAL_MB  FREE_MB  NAME  FAILGROUP  LABEL  PATH  UDID  PRODUCT  CREATE_DA  MOUNT_DAT  REPAIR_TIMER  READS  WRITES  READ_ERRS  WRITE_ERRS  READ_TIME  WRITE_TIME  BYTES_READ  BYTES_WRITTEN  
1  0  16777216  4042741803  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23255  DGGROUP_0000  PRIMARY     /dev/raw/raw1        30-NOV-07  28-OCT-08  0  36804  46018  0  0  786.1  582.06  6755644416  6763955200  
1  1  16777217  4042741804  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23370  DGGROUP_0001  PRIMARY     /dev/raw/raw2        30-NOV-07  28-OCT-08  0  44119  42268  0  0  867.22  603.79  7411873280  5409548288  
1  2  16777218  4042741805  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23271  DGGROUP_0002  PRIMARY     /dev/raw/raw3        30-NOV-07  28-OCT-08  0  32283  45450  0  0  869.36  649.66  7197868032  6651507712  
1  3  16777219  4042741809  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23398  DGGROUP_0003  PRIMARY     /dev/raw/raw7        25-FEB-08  28-OCT-08  0  34146  45752  0  0  711.88  687.74  9106022400  6570381824  
1  4  16777220  4042741810  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23272  DGGROUP_0004  FAILURE     /dev/raw/raw8        25-FEB-08  28-OCT-08  0  50672  42625  0  0  420.74  627.37  6999826432  5928142336  
1  5  16777221  4042741813  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1126399  21329  DGGROUP_0005  PRIMARY     /dev/raw/raw11        26-APR-08  28-OCT-08  0  43892  54946  0  0  815.75  880.13  8084622848  8770541056  
1  6  16777222  4042741806  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23342  DGGROUP_0006  FAILURE     /dev/raw/raw4        04-JAN-08  28-OCT-08  0  31169  45148  0  0  856.28  703.76  6300344320  5618552832  
1  7  16777223  4042741807  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23309  DGGROUP_0007  FAILURE     /dev/raw/raw5        04-JAN-08  28-OCT-08  0  37962  47718  0  0  1243.8  784.62  8271555072  9542723584  
1  8  16777224  4042741808  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  23362  DGGROUP_0008  FAILURE     /dev/raw/raw6        04-JAN-08  28-OCT-08  0  35696  47765  0  0  1063.55  810.58  7211213312  8064837120  
1  9  16777225  4042741814  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1126399  21366  DGGROUP_0009  PRIMARY     /dev/raw/raw12        26-APR-08  28-OCT-08  0  42290  52654  0  0  695.94  2057.64  8111098880  8949030912  
1  10  16777226  4042741811  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1126399  21371  DGGROUP_0010  FAILURE     /dev/raw/raw9        26-APR-08  28-OCT-08  0  57631  53387  0  0  973.89  947.48  8225914880  7936991232  
1  11  16777227  4042741812  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1126399  21308  DGGROUP_0011  FAILURE     /dev/raw/raw10        26-APR-08  28-OCT-08  0  45823  52156  0  0  926.72  2098.44  7714097152  7713911808  
1  12  16777228  4042741815  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  1034652  DGGROUP_0012  DGGROUP_0012     /dev/raw/raw13        28-OCT-08  28-OCT-08  0  2700  48486  0  0  159.19  469.68  918773760  1.0007E+10  
1  13  16777229  4042741816  CACHED  MEMBER  ONLINE  NORMAL  UNKNOWN  System  1048575  1034661  DGGROUP_0013  DGGROUP_0013     /dev/raw/raw14        28-OCT-08  28-OCT-08  0  2629  45224  0  0  72.31  1566.82  782823424  9782869504  


14 rows selected.

SQL>

The above was taken whilst a rebalance was going on as we had added disks.

This is how the ASM diskgroup looks now:

Code:

GROUP_NUMBER DISK_NUMBER HEADER_STATU MOUNT_S PATH           FAILGROUP
------------ ----------- ------------ ------- -------------- --------------------
           0           0 MEMBER       CLOSED  /dev/raw/raw3
           0           1 MEMBER       CLOSED  /dev/raw/raw7
           0           2 MEMBER       CLOSED  /dev/raw/raw11

           1           0 MEMBER       CACHED  /dev/raw/raw1  PRIMARY
           1           1 MEMBER       CACHED  /dev/raw/raw2  PRIMARY
           1           2 CANDIDATE    MISSING should be raw3
           1           6 MEMBER       CACHED  /dev/raw/raw4  FAILURE
           1           7 MEMBER       CACHED  /dev/raw/raw5  FAILURE
           1           8 MEMBER       CACHED  /dev/raw/raw6  FAILURE
           1           3 CANDIDATE    MISSING should be raw7
           1           4 MEMBER       CACHED  /dev/raw/raw8  FAILURE
           1          10 MEMBER       CACHED  /dev/raw/raw9  FAILURE
           1          11 MEMBER       CACHED  /dev/raw/raw10 FAILURE
           1           5 CANDIDATE    MISSING should be raw11
           1           9 MEMBER       CACHED  /dev/raw/raw12 PRIMARY
           1          12 MEMBER       CACHED  /dev/raw/raw13 DGGROUP_0012
           1          13 MEMBER       CACHED  /dev/raw/raw14 DGGROUP_0013

As can me seen raw3,7 and 11 have gone missing. Can anyone shed any light into how we get these disks back into the diskgroup. We tried:

1/ A repair, this succedded but didnt touch the offline disks.
2/ dropping the disks but this resulted in an error as well altough it started a rebalance which we stopped as we dont have enough space.

Please can anyone let us know how we can get these disks back?

Thanks in advance,
Chucks

**hrishy** · 11-06-2008, 05:00 AM

Hi

Whats the ownership and permissions on

Code:

/dev/raw/raw3
/dev/raw/raw7
/dev/raw/raw11

anything in the alert log of ASM instance regarding this.

you can also use the kfed utility to find out which diskgroup the disks belong to

make -f ins_rdbms.mk ikfed

kfed read devicename

ASM should be able to tell you which diskgroups the disks belonged to

http://askdba.org/weblog/?p=104

regards
Hrishy

**Chucks_k** · 11-06-2008, 07:20 AM

Hi Hrishy,

Thanks for that

We checked the ownerships and thy are fine.
We used od -c to check the devices and the output shows that those missing disks belong to the diskgroup, yet they are not part of the dg in asm. We have raised this with oracle. let's see what they say!

**hrishy** · 11-06-2008, 08:09 AM

Hi Chucks

What is od -c command ?
Please when you get a response from Oracle post the solution here.

did you try to run the kfed utility ?
curious to know if the rebalancing still going on from the previous operation before the SAN failed ?

regards
Hrishy

**Chucks_k** · 11-06-2008, 08:34 AM

Hi Hrishy,

od -c is similar to kfed but i feel gives more info:

Code:

$ od -c /dev/raw/raw3 | head
0000000 001 202 001 001  \0  \0  \0  \0 002  \0  \0 200 323   g   ) 215
0000020  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000040   O   R   C   L   D   I   S   K  \0  \0  \0  \0  \0  \0  \0  \0
0000060  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000100  \0  \0 020  \n 002  \0 002 003   D   G   G   R   O   U   P   _
0000120   0   0   0   2  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000140  \0  \0  \0  \0  \0  \0  \0  \0   D   G   G   R   O   U   P  \0
0000160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000200  \0  \0  \0  \0  \0  \0  \0  \0   P   R   I   M   A   R   Y  \0
0000220  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0

As can be seen above, the disk is being recognised as an oracle disk and part of the DGGROUP diskgroup in the PRIMARY failuere failgroup.
The previous rebalance finished on the first of Nov. As this dg is normal redundancy i was wonder if oracle would use its mirrors (failure failgroup) if it cant find the primary failgroup?

Thanks,
Chucks

**hrishy** · 11-06-2008, 11:56 PM

Hi Chucks

I beg to differ od is a unix command and just give you a octal dump of a raw device.

kfed is a C program by oracle to read a ASM disk and show its contents see the attached link above for a sample output.

Yes in this case since the primary group is not available it will start reading from the failgroup.

regards
Hrishy

**Chucks_k** · 11-07-2008, 09:26 AM

Hi Hrishy,

We have sorted the problem by adding the disks using the force clause:

Code:

ALTER DISKGROUP DGGROUP ADD FAILGROUP PRIMARY DISK
'/dev/raw/raw3' FORCE,'/dev/raw/raw7' FORCE,
'/dev/raw/raw11' FORCE;

There is a rebalance going on now but we have our system back now.

od -c - sorry yes i know its a unix cmd, i found this before i knew about kfed.

Thanks again,
Chucks

**hrishy** · 11-10-2008, 01:40 AM

Hi Chucks

You might want to investigate ASM 11g.

I for sure know that it has fast resilvering option especiually for the case that you outlined above.Wherein ASM would start noting all the changes that happened after the disks went missing and would apply those changes back once the disks are discovered and there wouldnt be a rebalance

regards
Hrishy

Thread: ASM diskgroup problems

Thread Tools

Display

ASM diskgroup problems

Posting Permissions