RMAN - A "delete archivelog.." incorrectly got rid of an archivelog?!
One of my "delete archivelog all backed up 2 times to sbt" scripts behaved whackily all of a sudden. It appears to have gotten rid of an archivelog file (1_3884.dbf) that had never been backedup!
Here's the background -
I've got 2 scripts scheduled via cron. One, backups up archivelogs to tape. The other deletes archivelogs that have been backed up twice already. In this instance, the latter incorrectly got rid of an archivelog file that was NEVER backedup to tape!
Note : I messed up with the scheduling..both the scripts ran at the SAME time (at 10am today). However, that doesn't explain why the latter script would get rid of an archivelog (1_3384.dbf) that had NEVER been backed up. The relevent archivelog 1_3384 got generated at 10am as well.
I've attached the relevent logfiles in a textfile.
1) Script 1 ( DWDEV_periodic_arch_bkup.rman) is simply
allocate channel t1 type 'sbt_tape' parms
allocate channel t2 type 'sbt_tape' parms
tag='DWDEV periodic arch log bkup'
ARCHIVELOG all not backed up 2 times;
release channel t1;
release channel t2;
2) Script 2 ( rm_bkedup_arch_log.rman) is
delete noprompt archivelog all backed up 2 times to sbt;
I've opened a TAR..it has been in an open status for the past hour and a half. Hasn't even got assigned to an analyst! I give up on them.
Update - Oracle recognizes this as a bug. They've not seen it before and need me to generate bucket loads of diagnostic information.
Just great! I get a warm and fuzzy feeling about my backups now.
and you just become their beta testers
I had 6 TARs in last 4 weeks and had to do all sort of things for them, spending my work time (yeah and customer is paying for it) work for Oracle
Low level peeps
I hear you. Atleast the TAR got assigned to a reasonably good analyst this time! He's reviewer of some Oracle books..and right off the bat admitted that it was a bug.
I was expecting them to close the TAR with a "unable to reproduce issue inhouse" message. But that wasn't the case. I was plesantly surprised this time.
Having said your problem, you had forgot to mention the database Version and the OS info. Giving those informations would be v.helpful to trace for those who encounter such issues in their site.
Life is a journey, not a destination!
target - 22.214.171.124
target OS - SunOS 5.8
catalog - 126.96.36.199
catalog OS - AIX 5.1
I'm guessing the issue had something to do with the "timing" or scheduling of the 2 scripts. For now, as a work around - I've incorporated both "backup archivlog" and "delete archivelog all backed up 2 times" into a SINGLE rman script (instead of splitting them into 2 separate jobs). That'll hopefully prevent it from occuring on production atleast.
I've left things 'as is' (with rman debug) on DEV to simulate the failure again.
Btw, the debug feature is quite enlightening..for those that haven't tried it - give it a shot, and look at the tracefile generated. It makes interesting reading. You get to see the different rman procedures being called, along with the bind variables. I hadn't seen this before. It seems neat.
Adding a debug is typically done like this -
connect target /
connect catalog .....
allocate channel c1 type ...... debug=5
Last edited by Axr2; 08-10-2004 at 04:23 PM.
Yes, I have tried it in the past and was really good to see what exactly is going on behind the scenes.
Life is a journey, not a destination!
An update :
Bug 3844804 has been filed. It is supposedly fixed in 10.2.
Per Oracle development :
"Thanks for providing a good debugging traces. We have found the problem.
We found that this problem can happen with DELETE command using option BACKED UP .. TIMES.
For eg. DELETE ARCHIVELOG ALL BACKED UP 1 TIMES TO DISK
can delete archivelog that aren't backed up when an archivelog is created during this command.
Use COMPLETED BEFORE option with BACKED UP option. For eg.
DELETE ARCHIVELOG ALL BACKED UP 1 TIMES TO DISK COMPLETED BEFORE 'SYSDATE-1';
This will make sure that RMAN doesn't delete the archivelog that was created during past one day.
Bug is fixed in Oracle 10g (10.2)"
As for the workaround - has the proposed solution actually been "tested"? I know what the new command/syntax means. But has it been tested? Did development prepare a test case that caused the original command to fail and the new command to work as expected? Or did they figure it out intuitively from the debug files? If so, I'd like to see the piece of code that helped them nail the issue. I just want to make sure that the proposed solution actually works. I
don't want to be in a situation where after a couple of months, I run into the same thing again (despite the new rman command)."
ORACLE SUPPORT :
Neither Development nor I were able to create a reproducible test case. The problem was found from the debug information you provided.
As for the 'piece of code' they found to be the issue, this is Oracle proprietary information and can not be shared.
Click Here to Expand Forum to Full Width