Non Oracle failover software solution - Advice needed
Our Sysadmin is trying to implement a hardware failover solution for an Oracle DB server. Because of cost he's plumped for something called 'Double Take' by NSI Software - sold onto us by a company called Sunbelt.
No - I was not consulted during the purchase phase but now the decision has been made I'm trying to see if the software will actually work.
Basically we have 2 servers LIVE1 and FAIL2. The DB exists on both but is shutdown on FAIL2 .
They done some limited testing and on both occasions there been a corruption - the worst being a controlfile corruption that stopped a DB opening on FAIL2 after a failover.
I’ve been reading the Sunbelt documentation and the software copies data, block by O.S. block as it detects changes (it uses CHECKSUM to detect these changes). This means it only ever copies what is changed on disk. So its not an Oracle solution at all - which got my alarm bells ringing
If we are using this solution to failover when we detect a hardware failure (ie – the server LIVE1 cannot be reached by the rest of the network) then at the point of failure there HAS to be information in the buffer cache that has yet to be written to disk. I’m not talking about application data as all committed work MUST be in datafiles or in redo log files on disk.
When the network fails, the source DB on LIVE1 is still open and all we have copied to the target database FAIL2 is blocks of changed data to the datafiles of an Oracle database that is not up and running yet.
When we decide to failover the source LIVE1 database is isolated from the rest of the network to prevent an IP conflict (as FAIL2 is now to assume LIVE1's IP address). Then the DB on LIVE1 may - or may not - be shutdown. (Note that the DB itself may not have failed at all – this solution is aimed at addressing hardware failure (network card, cabling …)).
So – whatever is in buffers of the DB on LIVE1 at the point of failure remains in the buffer cache of the source database. It has to. We’ve isolated that machine.
The target database FAIL2 is then started up, and there is immediately a conflict between what is written in the Controlfile (that the instance is already open) and what the database is trying to (open the instance).
Well, this my theory anyway. Does it hold water?