Datafile recovery scenario

**waitecj** · 10-20-2006, 10:31 AM

I am testing some RMAN backup and restore scenarios and have found something that I can't quite work out. The one I am having trouble with a scenario whereby a Unix Admin accidently renames the wrong raw volume. I have created a database on raw volumes. Each datafile points to its corresponding raw volume via a symbolic link (e.g. waitec_tbs001.dbf > /dev/vx/rdsk/tclora_dg/TPRCJW_waitec_tbs001). I have then created a table called TEST into tablespace WAITEC_TBS and inserted a couple of rows:

Code:

SQL*Plus: Release 9.2.0.7.0 - Production on Fri Oct 20 14:36:31 2006

Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.


Connected to:
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production

SQL> create table test(id number) tablespace waitec_tbs;

Table created.

SQL> insert into test values(1);

1 row created.

SQL> insert into test values(2);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from test;

        ID
----------
         1
         2

SQL> exit
Disconnected from Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production

I have then asked my Unix Admin to rename the raw volume from TPRCJW_waitec_tbs001 to TPRCJW_waitec_tbs002. Remember, the symbolic link still points to TPRCJW_waitec_tbs001. From here I expect that I will be able to insert more rows as they will be inserted in memory:

Code:

tclwmoc2-oracle> sqlplus /

SQL*Plus: Release 9.2.0.7.0 - Production on Fri Oct 20 14:41:10 2006

Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.


Connected to:
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production

SQL> select * from test;

        ID
----------
         1
         2

SQL> insert into test values(3);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from test;

        ID
----------
         1
         2
         3


SQL> select file_name, status from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------
STATUS
---------
/global/archtclora1_dg/u02/oradata/TPRCJW/system001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/undo1001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/undo2001.dbf
AVAILABLE


FILE_NAME
--------------------------------------------------------------------------------
STATUS
---------
/global/archtclora1_dg/u02/oradata/TPRCJW/users001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/tools001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/waitec_tbs001.dbf
AVAILABLE


6 rows selected.

SQL> select * from v$recover_file;

no rows selected


SQL> select file#, status from v$datafile;

     FILE# STATUS
---------- -------
         1 SYSTEM
         2 ONLINE
         3 ONLINE
         4 ONLINE
         5 ONLINE
         6 ONLINE

6 rows selected.

So far so good. Now, I expect that if I issue an "ALTER SYSTEM CHECKPOINT" (to flush dirty blocks and write the latest SCN in the datafile headers), Oracle should recognize that it can't find the datafile for my tablespace (WAITEC_TBS) and should then mark the datafile (number 6) as needing recovery:

Code:

-- Four redo log groups
SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system checkpoint;

System altered.

SQL> select file#, status from v$datafile;

     FILE# STATUS
---------- -------
         1 SYSTEM
         2 ONLINE
         3 ONLINE
         4 ONLINE
         5 ONLINE
         6 ONLINE

6 rows selected.

SQL> select file_name, status from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------
STATUS
---------
/global/archtclora1_dg/u02/oradata/TPRCJW/system001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/undo1001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/undo2001.dbf
AVAILABLE


FILE_NAME
--------------------------------------------------------------------------------
STATUS
---------
/global/archtclora1_dg/u02/oradata/TPRCJW/users001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/tools001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/waitec_tbs001.dbf
AVAILABLE


6 rows selected.

SQL> select * from v$recover_file;

no rows selected

SQL> insert into test values(4);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from test;

        ID
----------
         1
         2
         3
         4

SQL> alter system checkpoint;


System altered.

SQL> SQL> 

SQL> select * from v$recover_file;

no rows selected

SQL> select file_name, status from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------
STATUS
---------
/global/archtclora1_dg/u02/oradata/TPRCJW/system001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/undo1001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/undo2001.dbf
AVAILABLE


FILE_NAME
--------------------------------------------------------------------------------
STATUS
---------
/global/archtclora1_dg/u02/oradata/TPRCJW/users001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/tools001.dbf
AVAILABLE

/global/archtclora1_dg/u02/oradata/TPRCJW/waitec_tbs001.dbf
AVAILABLE

Nothing doing as you can see. Now, of course as soon as I try to change the state of the tablespace (by running begin backup or even a shutdown startup) it will complain:

Code:

SQL> alter tablespace waitec_tbs begin backup;
alter tablespace waitec_tbs begin backup
*
ERROR at line 1:
ORA-01116: error in opening database file 6
ORA-01110: data file 6:
'/global/archtclora1_dg/u02/oradata/TPRCJW/waitec_tbs001.dbf'
ORA-27041: unable to open file
SVR4 Error: 2: No such file or directory
Additional information: 3

What I don't understand is, in a real life scenario where you perhaps lose connection to a datafile for whatever reason, you'd think that you'd want to know about it before you start a backup or shutdown the database (we only do that once in a blue moon).
So my question is WHY? Surely the datafile should be marked as UNAVAILABLE or something?...
I assume that my test is fundamentally flawed in some way, and i'd be happy if you could point that out (if thats the case).

Obviously, this question isn't RMAN related, or even recovery related as such (I haven't even go that far yet!!), but before you even connect to RMAN, you'd think that you'd need to have a reason to in the first place!! In other words, if the DBA doesn't think anything is wrong, why would he be doing anything?!

If it makes a difference, we are running Solaris 8 and Veritas Volume Manager (but then you could probably see that from the naming convention...)

**ebrian** · 10-20-2006, 12:28 PM

Even though you "removed" the file, the process writing to the file still has a pointer to it via it's inode number and therefore has it open. Essentially, you removed it's directory entires and therefore, new processes would not be able to write to the file. The same phenomena can be experienced by deleting a control file.

**waitecj** · 10-20-2006, 01:01 PM

Originally Posted by ebrian

Even though you "removed" the file, the process writing to the file still has a pointer to it via it's inode number and therefore has it open. Essentially, you removed it's directory entires and therefore, new processes would not be able to write to the file. The same phenomena can be experienced by deleting a control file.

Thanks for the reply.

When you say new process?... I am logging out and logging back in again (so have spawned a new server process), and I am still able to (or at least appear to) write to that file. Is that not what you mean't?

If this is one of those "working as designed" phenomena, how would you get around it in a real world scenario if a few weeks down the line to take a datafile offline only to find that Oracle can't find it? Again, I apologise if I have misinterpreted you.

**jhmartin** · 10-23-2006, 02:57 PM

Your new process is just a fork of an existing process, so it brought its filehandles along with it.

**ixion** · 10-23-2006, 06:10 PM

DBWR opened the file for writing. Inode handle is still being used since the sys admin merely renamed! So technically the file is still available.

Loging off and on, merely creates background servers which only read the Data file when a block is not first found in the Buffer Cache.

Have your sysadmin take that raw device offline...

Bigger question, why use symbolic links to raw devices? If you are running RAC. I'd think you'd be better off using just the RAW device name. That way its automatically available to both nodes when the tablespace is created. You don't have to remember to go and create the softlink on the other nodes....

**jhmartin** · 10-23-2006, 06:45 PM

Oh, and Oracle doesn't wait for a checkpoint to update the datafiles. It may start writing the changes to the datafile and redo immediately (after creating the necessary undo), with the expectation that an instance failure will cause the change to be rolled back. A checkpoint just forces the buffers to be written to disk immediately.

otn.oracle.com indicates that a tablespace is automatically taken offline when a DBWn encounters repeated failures when writing to a datafile.

**waitecj** · 10-25-2006, 06:26 AM

Originally Posted by ixion

DBWR opened the file for writing. Inode handle is still being used since the sys admin merely renamed! So technically the file is still available.

Loging off and on, merely creates background servers which only read the Data file when a block is not first found in the Buffer Cache.

Have your sysadmin take that raw device offline...

Bigger question, why use symbolic links to raw devices? If you are running RAC. I'd think you'd be better off using just the RAW device name. That way its automatically available to both nodes when the tablespace is created. You don't have to remember to go and create the softlink on the other nodes....

starting to make some sense I guess, but then does it fail when I try to issue a "begin backup" command. I tried to get the sys admin to take the raw device offline, but it didn't work as the file was still held by Oracle (as you'd expect).

We use symbolic links for pure convenience. We have many disk groups, but put all our symbolic links regardless in the same place so even if we decided to move the raw volume or whatever, the database wouldn't have to chnage (only the symbolic link).

**waitecj** · 10-25-2006, 06:27 AM

Originally Posted by jhmartin

Oh, and Oracle doesn't wait for a checkpoint to update the datafiles. It may start writing the changes to the datafile and redo immediately (after creating the necessary undo), with the expectation that an instance failure will cause the change to be rolled back. A checkpoint just forces the buffers to be written to disk immediately.

otn.oracle.com indicates that a tablespace is automatically taken offline when a DBWn encounters repeated failures when writing to a datafile.

I know that much mate, I was just trying to illustrate my point by issuing the command. Clarity in questions breed meaningful answers.

**ixion** · 10-25-2006, 12:22 PM

Originally Posted by waitecj

starting to make some sense I guess, but then does it fail when I try to issue a "begin backup" command. I tried to get the sys admin to take the raw device offline, but it didn't work as the file was still held by Oracle (as you'd expect)..

Hmm, he should be able to force it offline.

Originally Posted by waitecj

We use symbolic links for pure convenience. We have many disk groups, but put all our symbolic links regardless in the same place so even if we decided to move the raw volume or whatever, the database wouldn't have to chnage (only the symbolic link).

No more work would be needed to just use the raw device name. Just would need to do a database file rename if the raw device is relocated. instead of having to do recreate of the softlink. IMHO, this is better...

**yxez** · 10-25-2006, 07:52 PM

You guys are impressive...you did your homewrok well...
I still don't understand though whats the fuzz

Just be sure to test your backup by restoring then on a test
server.

Thread: Datafile recovery scenario

Thread Tools

Display

Datafile recovery scenario

Posting Permissions