Non Oracle failover software solution - Advice needed
DBAsupport.com Forums - Powered by vBulletin
Results 1 to 10 of 10

Thread: Non Oracle failover software solution - Advice needed

  1. #1
    Join Date
    Jan 2000
    Location
    Chester, England.
    Posts
    818

    Non Oracle failover software solution - Advice needed

    Our Sysadmin is trying to implement a hardware failover solution for an Oracle DB server. Because of cost he's plumped for something called 'Double Take' by NSI Software - sold onto us by a company called Sunbelt.

    No - I was not consulted during the purchase phase but now the decision has been made I'm trying to see if the software will actually work.

    Basically we have 2 servers LIVE1 and FAIL2. The DB exists on both but is shutdown on FAIL2 .

    They done some limited testing and on both occasions there been a corruption - the worst being a controlfile corruption that stopped a DB opening on FAIL2 after a failover.

    Ive been reading the Sunbelt documentation and the software copies data, block by O.S. block as it detects changes (it uses CHECKSUM to detect these changes). This means it only ever copies what is changed on disk. So its not an Oracle solution at all - which got my alarm bells ringing

    If we are using this solution to failover when we detect a hardware failure (ie the server LIVE1 cannot be reached by the rest of the network) then at the point of failure there HAS to be information in the buffer cache that has yet to be written to disk. Im not talking about application data as all committed work MUST be in datafiles or in redo log files on disk.

    When the network fails, the source DB on LIVE1 is still open and all we have copied to the target database FAIL2 is blocks of changed data to the datafiles of an Oracle database that is not up and running yet.

    When we decide to failover the source LIVE1 database is isolated from the rest of the network to prevent an IP conflict (as FAIL2 is now to assume LIVE1's IP address). Then the DB on LIVE1 may - or may not - be shutdown. (Note that the DB itself may not have failed at all this solution is aimed at addressing hardware failure (network card, cabling )).

    So whatever is in buffers of the DB on LIVE1 at the point of failure remains in the buffer cache of the source database. It has to. Weve isolated that machine.

    The target database FAIL2 is then started up, and there is immediately a conflict between what is written in the Controlfile (that the instance is already open) and what the database is trying to (open the instance).

    Well, this my theory anyway. Does it hold water?
    Last edited by JMac; 05-19-2006 at 07:12 AM.

  2. #2
    Join Date
    Sep 2003
    Location
    over the hill and through the woods
    Posts
    995
    Hey Jmac, long time no see

    Looks like you've got yourself into a real pickle here.

    I can't offer any advise other than I'm curious...what if you find out that it's this 3rd parties problem any possiblity you can tell them to sod off and stick with oracle?
    Oracle it's not just a database it's a lifestyle!
    --------------
    BTW....You need to get a girlfriend who's last name isn't .jpg

  3. #3
    Join Date
    Jan 2000
    Location
    Chester, England.
    Posts
    818
    Hello Doc. Been busy - big changes here. Takeover - new data centres ... busy boy. Anyway - I'm back now.

    If I can prove that the software can't cope with transferring the data in the buffer cache then it'll kill this project (which is good!) and we can move onto RAC or whatever. But I have to be sure.

    As its a cheap option management want it to work; as a DBA I want something reliable and supported by Oracle (which Double take is not).

  4. #4
    Join Date
    Sep 2003
    Location
    over the hill and through the woods
    Posts
    995
    Ahhh, money it's the root of all managment evil..

    Tell them you'll save them a ton of money and just go with Dataguard. No expensive RAC. Hell, you've already got the server just go slap another database on there and put it in hot standby problem solved

    Here's is some fodder for you why you don't need rac
    Oracle it's not just a database it's a lifestyle!
    --------------
    BTW....You need to get a girlfriend who's last name isn't .jpg

  5. #5
    Join Date
    Jul 2002
    Posts
    335
    Hi, I really don't see how this is going to work. You're right, you'll have stuff in memory that won't be on disk etc. Really not surprised you've hit corruption issues.

    As an alternative, can you not set up a physical standby, and use dataguard? Then configure it in maximum performance or availability depending on your requirements? This way you'll know you can failover quickly with no data corruption.

    Bazza

  6. #6
    Join Date
    Jul 2002
    Posts
    335
    Ah, looks like I just repeated what the Doc says...

  7. #7
    Join Date
    Sep 2002
    Location
    England
    Posts
    7,333
    IN pricciple this should work, whether the product is working is a different matter.

    We do the same things with SAN replication, the SAN copies itself bit by bit to another SAN in the DR site, if site1 fails then we just start the database on the other site, it works and works well. At the same time the link between the SANs is broken.

    Do you possible have a case where the file is still being written to as it is open by the second instance?

    Worth pointing out again that ALL committed transactions are on disk somewhere, what is in memory doesnt really matter.

    So the principle is sound, product sounds dodgy

  8. #8
    Join Date
    Jan 2000
    Location
    Chester, England.
    Posts
    818
    Hi Davey.

    Maybe the product is dodgy, but I still think my theory holds water ... no?

    FAIL2 - the target failover DB - is always shutdown on the server whilst the data is being replicated to it. Including any change to the controlfiles that are flushed to disk from buffer.

    What if the replication stops - because the source server falls out of the network - before the controlfile(s) can be updated with whatever is in the buffers? Its not data, so COMMITs etc don't come into it.

    The startup on FAIL2 after a failover uses whatever datafiles were copied over or updated from the Source system. So the state of the physical controlfile(s) on disk might not be compliant with a database thats shut and is about to open (as they were last updated whilst the Source DB was open).

    Here's a section of the TAR notes I have from Support - they missed the point a bit but kind of support my thinking:

    ISSUE CLARIFICATION
    ====================

    + Platform : Windows 2003 server.
    + DB Server software 9.2.0.6
    + Are attempting to implement a hardware failover, high-availability solution for a production system using
    'Sunbelt Software Double Take'.
    + SQL attempted is the MOUNT statement at at startup.
    ksedmp: internal or fatal error
    ORA-00600: internal error code, arguments: [kccsbck_first], [1], [1943107820], [], [], [], [], []
    Current SQL statement for this session:
    ALTER DATABASE MOUNT

    UPDATE
    =======
    1/ As Oracle support doesn't give support on 'Sunbelt Software Double Take' , Please could you clarify the issue
    as it seems you are using a High availability system environment without using our HA system named RAC
    ( Real Application cluster ) ?

    2/ I have found the following article 243549.1 ( ONLY USE IN A RAC ENV.)
    this issue could be related as the main problem sound like a NIC issue and Microsoft windows.
    Comments :
    ---------------------
    The Windows OS disables the internal NIC after experiencing the communications failure on
    the interconnect, causing all the locally bound socket to fail. This causes the clusterware (CM Service) to exist in a dead state
    although the database instance is actually up and running.

    Please could you check this article , there is a link to microsoft support web site.


    Thank you,
    Regards.
    PAUL GAMEIRO.









    DATA COLLECTED
    ===============
    ALERT LOG
    -----------
    The alert_vald33.log file shows:

    Wed Apr 19 17:25:28 2006
    Current log# 1 seq# 2560 mem# 0: E:\ORADATA\PMX33\VALD33\REDO01AVALD33.LOG
    Current log# 1 seq# 2560 mem# 1: F:\ORADATA\PMX33\VALD33\REDO01BVALD33.LOG
    Current log# 1 seq# 2560 mem# 2: G:\ORADATA\PMX33\VALD33\REDO01CVALD33.LOG
    Wed Apr 19 17:25:28 2006
    ARC0: Beginning to archive log 3 thread 1 sequence 2559
    Creating archive destination LOG_ARCHIVE_DEST_1: 'J:\ORADATA\PMX33\VALD33\ARCHIVE\VALD332559.ARCLOG'
    ARC0: Completed archiving log 3 thread 1 sequence 2559
    Dump file e:\oradata\pmx33\vald33\admin\bdump\alert_vald33.log
    Wed Apr 19 17:29:26 2006
    ORACLE V9.2.0.6.0 - Production vsnsta=0
    vsnsql=12 vsnxtr=3
    Windows 2000 Version 5.2 Service Pack 1, CPU type 586
    Wed Apr 19 17:29:26 2006
    Starting ORACLE instance (normal)
    Wed Apr 19 17:29:26 2006
    Running with 1 strand for Non-Enterprise Edition
    LICENSE_MAX_SESSION = 0
    LICENSE_SESSIONS_WARNING = 0
    SCN scheme 2
    Running with 1 strand for Non-Enterprise Edition
    LICENSE_MAX_USERS = 0
    SYS auditing is disabled
    Starting up ORACLE RDBMS Version: 9.2.0.6.0.
    System parameters with non-default values:
    processes = 500
    timed_statistics = TRUE
    shared_pool_size = 159383552
    sga_max_size = 1074341756
    java_pool_size = 159383552
    control_files = E:\oradata\pmx33\VALD33\controlAVALD33.ctl, F:\oradata\pmx33\
    VALD33\controlBVALD33.ctl, G:\oradata\pmx33\VALD33\controlCVALD33.ctl
    db_block_size = 8192
    db_cache_size = 67108864
    compatible = 9.2.0
    log_archive_start = TRUE
    log_archive_dest = J:\oradata\pmx33\VALD33\archive
    log_archive_format = VALD33%s.arclog
    log_buffer = 8192
    log_checkpoint_interval = 40000
    db_files = 20
    db_file_multiblock_read_count= 8
    fast_start_mttr_target = 0
    dml_locks = 100
    undo_management = AUTO
    undo_tablespace = UNDOTBS
    remote_login_passwordfile= EXCLUSIVE
    db_domain =
    instance_name = VALD33
    background_dump_dest = E:\oradata\pmx33\VALD33\admin\bdump
    user_dump_dest = E:\oradata\pmx33\VALD33\admin\udump
    max_dump_file_size = 10240
    core_dump_dest = E:\oradata\pmx33\VALD33\admin\cdump
    sort_area_size = 524288
    db_name = valid33
    open_cursors = 300
    PMON started with pid=2
    DBW0 started with pid=3
    LGWR started with pid=4
    CKPT started with pid=5
    SMON started with pid=6
    RECO started with pid=7
    Wed Apr 19 17:29:29 2006
    ARCH: STARTING ARCH PROCESSES
    ARC0 started with pid=8
    ARC0: Archival started
    ARC1 started with pid=9
    Wed Apr 19 17:29:30 2006
    ARCH: STARTING ARCH PROCESSES COMPLETE
    Wed Apr 19 17:29:30 2006
    Oracle Data Guard is not available in this edition of Oracle.
    Wed Apr 19 17:29:30 2006
    ARC0: Thread not mounted
    Wed Apr 19 17:29:30 2006
    alter database mount exclusive
    Wed Apr 19 17:29:31 2006
    ARC1: Archival started
    Wed Apr 19 17:29:31 2006
    ARC1: Thread not mounted
    Wed Apr 19 17:29:35 2006
    Errors in file e:\oradata\pmx33\vald33\admin\udump\vald33_ora_5600.trc:
    ORA-00600: internal error code, arguments: [kccsbck_first], [1], [1943107820], [], [], [
    ], [], []
    Wed Apr 19 17:29:37 2006
    ORA-600 signalled during: alter database mount exclusive...
    Starting ORACLE instance (normal)
    Shutting down instance: further logons disabled
    Shutting down instance (immediate)
    ...


    TRACE FILE
    ------------
    The vald33_ora_636.trc trace file shows:

    *** SESSION ID:(9.1) 2006-04-19 17:31:24.961
    *** 2006-04-19 17:31:24.961
    ksedmp: internal or fatal error
    ORA-00600: internal error code, arguments: [kccsbck_first], [1], [1943107820], [], [], [],
    [], []
    Current SQL statement for this session:
    ALTER DATABASE MOUNT
    ----- Call Stack Trace -----
    ksedmp ksfdmp kgerinv kgesinv ksesin kccsbck kccocf kcfcmb kcfmdb adbdrv opiexe opiosq0 kpooprx kpoal8 opiodr ttcpip opitsk opiino opiodr opi
    drv sou2o opimai OracleThreadStart@4
    ======

    The process state dump shows:

    Process global information:
    process: 45E617EC, call: 46077C5C, xact: 468BAB74, curses: 45EE23B4, usrses: 45EE23B4
    ----------------------------------------
    SO: 45E617EC, type: 2, owner: 00000000, flag: INIT/-/-/0x00
    (process) Oracle pid=10, calls cur/top: 46077C5C/46077C5C, flag: (0) -
    int error: 0, call error: 0, sess error: 0, txn error 0
    ...
    SO: 45EE23B4, type: 4, owner: 45E617EC, flag: INIT/-/-/0x00
    (session) trans: 468BAB74, creator: 45E617EC, flag: (41) USR/- BSY/-/-/-/-/-
    DID: 0000-000A-00000008, short-term DID: 0000-0000-00000000
    txn branch: 00000000
    oct: 35, prv: 0, sql: 47FDCD00, psql: 47FDCD00, user: 0/SYS
    O/S info: user: ad, term: CA1971, ospid: 4900:4232, machine: AEL\CA1971
    program: sqlplus.exe
    application name: sqlplus.exe, hash value=0
    last wait for 'control file sequential read' blocking sess=0x0 seq=29 wait_time=42808
    file#=2, block#=3, blocks=1
    temporary object counter: 0
    ...
    SO: 46077C5C, type: 3, owner: 45E617EC, flag: INIT/-/-/0x00
    (call) sess: cur 45ee23b4, rec 45ee2d24, usr 45ee23b4; depth: 0
    ----------------------------------------
    SO: 460E0510, type: 6, owner: 46077C5C, flag: INIT/-/-/0x00
    (enqueue) CF-00000000-00000004 DID: 0000-000A-00000008
    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    res: 4615b960, mode: S, prv: 4615b968, sess: 45ee23b4, proc: 45e617ec
    ----------------------------------------
    SO: 460E04C4, type: 6, owner: 46077C5C, flag: INIT/-/-/0x00
    (enqueue) CF-00000000-00000000 DID: 0000-000A-00000007
    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    res: 4615b90c, mode: X, prv: 4615b914, sess: 45ee23b4, proc: 45e617ec
    ----------------------------------------
    SO: 460E0478, type: 6, owner: 46077C5C, flag: INIT/-/-/0x00
    (enqueue) IS-00000000-00000000 DID: 0000-000A-00000004
    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    res: 4615b864, mode: X, prv: 4615b86c, sess: 45ee23b4, proc: 45e617ec
    ----------------------------------------
    SO: 45EE2D24, type: 4, owner: 46077C5C, flag: INIT/-/-/0x00
    (session) trans: 00000000, creator: 00000000, flag: (2) -/REC -/-/-/-/-/-
    DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
    txn branch: 00000000
    oct: 0, prv: 0, sql: 00000000, psql: 00000000, user: 0/SYS
    temporary object counter: 0
    ===================================================


    RESEARCH
    =========
    (Note: This is INTERNAL ONLY research. No action should be taken by the customer on this information.
    This is research only, and may NOT be applicable to your specific situation.)

    The ORA-600 [kccsbck_first] error here is described in Note:139013.1 as:

    We receive this error because we are attempting to be the first thread/instance to mount the database and cannot because it appears that
    at least one other thread has mounted the database already.

    We therefore abort the mount attempt and log this error.

    This could certainly be due to the Microsoft network configuration setting as described in Note:243549.1 due to
    Microsoft bug Q239924. Basically, as the instance restarts successfully on the original server, it looks like not all the data has been flushed to the controlfiles before they are copied to the new server, hence causing the ORA-600 [kccsbck_first] as we get a different mount Id and find the controlfiles already marked as mounted by the previous instance.



    21-APR-06 08:16:49 GMT

    UPDATE
    =======
    Called +44 1244 845700.
    Left a message for John advising him of the above updates. Informed him that the error here is occurring because the original instance is not fully shutdown correctly with ALL buffers flushed to disk, and so when the controlfiles are copied, they still show the instance mounted, which then conflicts when trying to mount the instance on the failover server.

  9. #9
    Join Date
    Jan 2000
    Location
    Chester, England.
    Posts
    818
    Oh, and can I stress that Oracle do not support Double-Take!

    They will help us get back up and running after a corruption etc. caused by it, but that might be to tell us to revert to a back-up which kinda contradicts the entire failover-in-minutes theory - no?

    (When I heard that I wanted to pull the plug ... )

  10. #10
    Join Date
    Sep 2003
    Location
    over the hill and through the woods
    Posts
    995
    Well there's your answer right there Jmac!
    1/ As Oracle support doesn't give support on 'Sunbelt Software Double Take' Badda Bing
    Oracle it's not just a database it's a lifestyle!
    --------------
    BTW....You need to get a girlfriend who's last name isn't .jpg

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Click Here to Expand Forum to Full Width