Slow OLTP on 14CPU SunFire
DBAsupport.com Forums - Powered by vBulletin
Page 1 of 5 123 ... LastLast
Results 1 to 10 of 44

Thread: Slow OLTP on 14CPU SunFire

  1. #1
    Join Date
    Jun 2005
    Location
    Yorkshire, UK
    Posts
    12

    Slow OLTP on 14CPU SunFire

    We have a large (3TB) Oracle 8.1.7 OLTP database running on a 14CPU SunFire server (48GB RAM). This drives a website with around 100 users requests per second (transactional commits per second). We use MTS to manage web server connection requests.

    Generally the system runs fine - easy sub second responses. However, this can be very easily upset if we need to do maintenance tasks (data loads and index rebuilds) and we find that this brings the system to a virtual halt.

    We are planning a large upgrade and data migration on the system - which will have to include work during working hours (it's a 6 month project). As the DBA, I'm concerned that the system this powerful cannot support OLTP users and our data management tasks.

    Is this typical? Do other large OLTP Oracle system have the same "feature" where a 5 minute "CREATE TABLE AS SELECT" command in SQLPLUS turns millisecond web responses into 5+ second responses.

    Example - During normal working day, a look at V$SESSION will show 100 sessions, with only 5 or so "ACTIVE" at that very moment. (Each web request is sub second so you don't see them!) However, if I run a "CREATE TABLE newtable AS SELECT * FROM table_with_a_million_rows" then V$SESSION will show up all the active users (60+) and their active session time goes into many seconds. User response becomes very poor (the phone starts ringing off the hook!).

    If I run the same in "NOLOGGING" mode, then there is no noticable problem. Is this a LGWR problem?

    I know this post does not have many "facts" - I will be willing to supply as requred.

    Thanks in advance - Guy

  2. #2
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    what are your users waiting?

    and yes when you do CTAS you should do it in nologging

    you might consider that your archiver process are not fast enough or the archive destination is not fast enough or your log files are too small or have you checked your alert?

    I mean being a Senior DBA (I wouldnt let a junior manage a 3TB OLTP system) you cannot just based on v$session to perform your tuning tasks correct?

  3. #3
    Join Date
    Jun 2005
    Location
    Yorkshire, UK
    Posts
    12
    Only been an Oracle DBA for 15 years - so yes, I'm junior.

    But thanks for the reply!

    I could have posted hundreds of lines of statspack values, and will if applicable. But before I get deep in diagnositic mode (and I have been for the last few weeks) I want to make sure I actually have a problem!

    The CTAS was an example. It's a way I can reproduce the "problem" in an repeatable manner. It is indicative of the problem when we have to do a lot of updates to the system. It would be easy to use NOLOGGING all the time - but it would rather negate the point of having Oracle! This would not be an option in production.


    The V$SESSION example shows that usually, it's empty. But when I give the system a little work to do - then suddenly these 10ms sessions take 5000ms to complete and we can measurably see the system is slow for them, which is why we they come up on V$SESSION.

    Alert shows nothing (just log switches).


    The question remains - if you have a hight OLTP system with 100's of commits a second - and you run a "high redo" operation (create index / copy table, update many rows, insert many rows, CTAS) - would you EXPECT at 14CPU machine to stop processing the very light OLTP requests?


    I honestly think no. I think a correctly configured 4 CPU intel box would easily cope with the same task. The point is "correctly configured" - which this system appears not to be.


    My "weakness" is that I don't know much about the specific application, nor did I set up the server. To be honest, it's not well written (external company) and it uses no stored procedures, RI, views etc. The company who wrote the code have never implemented it on such a scale before. Unfortunatly we have no way of changing the code or the SQL (which is why we are migrating to the newer version which is a little more Oracle friendly) and 9r2.


    Specifics:

    We have a 2.5MB redo buffer - we are not getting redo buffer waits but we do get high 'log sync waits'. I feel this is a cumulative problem - when the session commits, it waits for LGWR to flush the redo buffer. As we have 100 commits a second, it means we have 100 times the "wait time" recorded - so this may not be indicative of the problem per say.

    Yes - I think REDO speed is a part of this - we have dedicated REDO drives (currently a pair of JBODs multiplexed) with 6 members in 2 groups each 256MB. We switch every 15 mins or so during the day.

    From what I see, ARCHIVE has little impact. It takes 10 to 20 seconds to copy the redo to archive - there is no difference between that 20 second copy remaining 14.5 minutes when it's just filling up redo logs.


    We have experimented on a test system (without the same load) and get acceptable CTAS performance on it with single drives. It's a bit more difficult on the prod system to test.




    Here is the top of the stats pack for a 10 minute snapshot:

    Code:
    STATSPACK report for
    
    DB Name         DB Id    Instance     Inst Num Release     OPS Host
    ------------ ----------- ------------ -------- ----------- --- ------------
    XXXXXX         532351126 XXXXXX 1 8.1.7.4.0   NO  XXX-DATASRV
    
                    Snap Id     Snap Time      Sessions
                    ------- ------------------ --------
     Begin Snap:       7575 03-Jun-05 10:24:32      115
       End Snap:       7576 03-Jun-05 10:34:48      115
        Elapsed:                  10.27 (mins)
    
    Cache Sizes
    ~~~~~~~~~~~
               db_block_buffers:     200000          log_buffer:    2621440
                  db_block_size:       8192    shared_pool_size:  200000000
    
    Load Profile
    ~~~~~~~~~~~~                            Per Second       Per Transaction
                                       ---------------       ---------------
                      Redo size:            236,471.74              2,580.18
                  Logical reads:             16,930.78                184.73
                  Block changes:                710.96                  7.76
                 Physical reads:              2,810.45                 30.67
                Physical writes:                 69.84                  0.76
                     User calls:              5,009.34                 54.66
                         Parses:                243.37                  2.66
                    Hard parses:                 11.26                  0.12
                          Sorts:                  9.51                  0.10
                         Logons:                  0.13                  0.00
                       Executes:              2,677.66                 29.22
                   Transactions:                 91.65
    
      % Blocks changed per Read:    4.20    Recursive Call %:    3.58
     Rollback per transaction %:    0.00       Rows per Sort:   37.45
    
    Instance Efficiency Percentages (Target 100%)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                Buffer Nowait %:  100.00       Redo NoWait %:  100.00
                Buffer  Hit   %:   83.40    In-memory Sort %:  100.00
                Library Hit   %:   98.98        Soft Parse %:   95.37
             Execute to Parse %:   90.91         Latch Hit %:   99.93
    Parse CPU to Parse Elapsd %:   96.01     % Non-Parse CPU:   98.03
    
     Shared Pool Statistics        Begin   End
                                   ------  ------
                 Memory Usage %:   43.71   43.73
        % SQL with executions>1:   86.17   86.17
      % Memory for SQL w/exec>1:   82.55   82.56
    
    Top 15 Wait Events
    ~~~~~~~~~~~~~~~~~                                             Wait     % Total
    Event                                               Waits  Time (cs)   Wt Time
    -------------------------------------------- ------------ ------------ -------
    log file sync                                      56,757      225,427   63.60
    db file sequential read                         1,729,916       93,853   26.48
    log file parallel write                            39,154       29,790    8.40
    db file parallel write                                576        2,203     .62
    SQL*Net more data to client                       540,735        2,161     .61
    control file parallel write                           206          508     .14
    log file sequential read                            4,194          163     .05
    log file switch completion                              5           75     .02
    latch free                                            875           70     .02
    SQL*Net break/reset to client                      14,042           69     .02
    file open                                           4,455           65     .02
    db file scattered read                                239           35     .01
    enqueue                                                26           21     .01
    buffer busy waits                                     103           13     .00
    refresh controlfile command                           110            4     .00

  4. #4
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    When you do CTAS do you check what are your users waiting for?

    Log file sync is cause by commits, if your application must do vast amount of commits then the way to improve would be putting your redo logs on fast disks. Since you are suffering

    log file sync
    log file parallel write
    log file switch completion
    log file sequential read

    You might as well check your redo log speed (all members and all groups), you should check you arch process as well, the arch destination must be fast.

    I suspect the disks where the redo logs are located are not fast enough.

    Also I see you suffer 5 log file switch completion in 10 minutes. When the logfile are switching and the switch is not complete all the sessions are hanged therefore you see the 10ms operations take 5000ms.

    Dont you see messages such as, cannot allocate new log in your alert?

    And I didnt mean you are Junior, I just thought that one would check first what are the users waiting for when the system hangs! And that info is certainly not in v$session!

    Cheers
    Last edited by pando; 06-03-2005 at 08:15 AM.

  5. #5
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    And just curious, doesnt the App/Web Server have a connection pool?

    In the past I have seen MTS + Connection Pooling combinations dont perform very well

  6. #6
    Join Date
    May 2005
    Posts
    129
    Most of your wait time is log file sync 225,427

    Perhaps you should convert the JBOD to striped.

    All those CPU's wont help when your problem is I/O bottleneck..here is a look at CPU bottleneck and notice how quick all the log file synchs are:

    Code:
    Top 5 Timed Events
                                                                 % Total
    Event                                     Waits    Time (s) Ela Time
    ---------------------------------- ------------ ----------- --------
    CPU time                                  4,851       4,042    55.76
    db file sequential read                   1,968       1,997    27.55
    log file sync                           299,097         369     5.08
    db file scattered read                   53,031         330     4.55
    log file parallel write                 302,680         190     2.62
    Last edited by roadwarriorDBA; 06-03-2005 at 08:21 AM.

  7. #7
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    And what size of redo log buffer are you using?
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  8. #8
    Join Date
    Jan 2001
    Posts
    2,828
    Hi Slim

    He is using 2.5Mb redo buffer.

    regards
    Hrishy

  9. #9
    Join Date
    Jun 2005
    Location
    Yorkshire, UK
    Posts
    12
    pando,
    Don't know about connection pooling - will ask. What would be a better alternative?

    How can I tell how many "sessions" or users passed thru the system in the 10 mins of the snap shot when in MTS mode?



    roadwarriorDBA,

    I *think* the 'log sync wait' can be a red herring - if you follow this logic...

    I have 91 transactions per second (from the statspack report) - and in the 10 minutes I waited 225,427 cs.

    91 commits per second * 60 seconds * 10 minutes = 54600 commits in 10 minutes.

    225,427 cs waited / 54,600 = 4.13 cs per commit. Each commit waits, on average 4 hundreths of a second.

    I waited 56,757 times - each time 0.04 of a second. Sounds not too bad like that?

    Or tell me I'm wrong!

    From your stats, you can see that your log sync waits are caused by slow log file parallel writes.


    I've basically got the following options (if redo speed is at fault)

    1/ I have 6 JBOD drives on the system - I can create 3 pairs of redo drives - each "pair" would be hold one redo set. After a log switch, LGWR would move to the next drive and archiver would use the previous drive. No contention there. But is it fast enough?

    2/ I stripe 3 drives and mirror them. Will this be faster that dedicated drives?

    3/ I reclaim one of the R5 arrays as redo. (not nice)

    4/ I spend many $$$ on solid state REDO drives. BUT NOT before I PROVE redo speed is at the root cause.


    BTW - Our ARCH space is a 9 x 72GB RAID5 fibre channel array. We will soon be using a remote STANDBY database - so archives will be shipped over the WAN too.



    One way to test this would be to do a statspack report while the system is running a CTAS with a snap before and after. I'll do this later today when no-one is looking


    Thanks for your feedback so far guy's (and gal's) It's giving me plenty of idea's.

  10. #10
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,095
    Now that we have the whole system view, I would kind of look at it from two different times. The most critical part to your system seems to be the web sessions, that's where I would start.

    First, I would want to get a trace of a session during "Normal" time to see what my waits are. You may find that you are still waiting on "log file sync", just not as much. Since you have multiple connections, you might have to trace for an hour or so.

    I would then do the same trace while I was doing a CTAS to see where my application thinks it is spending most of it's time.

    I would then compare the two traces and see where the major differences are.

    You may also want to monitor your disk I/O using iostat. If your log disks are really busy striping and raw disks may be in order.

    When I see the "log file sync" event as the top wait event, I know it means my developers are over committing. However, in your case, you really have two different types of sessions, OLTP and long running transactions. When a "log file sync" happens, LGWR is working really hard. The condition may be exacerbated by your big redo log buffer.
    Jeff Hunter
    marist89@yahoo.com
    http://marist89.blogspot.com/
    Get Firefox!
    "I pledge to stop eating sharks fin soup and will not do so under any circumstances."

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width