Only been an Oracle DBA for 15 years - so yes, I'm junior.
But thanks for the reply!
I could have posted hundreds of lines of statspack values, and will if applicable. But before I get deep in diagnositic mode (and I have been for the last few weeks) I want to make sure I actually have a problem!
The CTAS was an example. It's a way I can reproduce the "problem" in an repeatable manner. It is indicative of the problem when we have to do a lot of updates to the system. It would be easy to use NOLOGGING all the time - but it would rather negate the point of having Oracle! This would not be an option in production.
The V$SESSION example shows that usually, it's empty. But when I give the system a little work to do - then suddenly these 10ms sessions take 5000ms to complete and we can measurably see the system is slow for them, which is why we they come up on V$SESSION.
Alert shows nothing (just log switches).
The question remains - if you have a hight OLTP system with 100's of commits a second - and you run a "high redo" operation (create index / copy table, update many rows, insert many rows, CTAS) - would you EXPECT at 14CPU machine to stop processing the very light OLTP requests?
I honestly think no. I think a correctly configured 4 CPU intel box would easily cope with the same task. The point is "correctly configured" - which this system appears not to be.
My "weakness" is that I don't know much about the specific application, nor did I set up the server. To be honest, it's not well written (external company) and it uses no stored procedures, RI, views etc. The company who wrote the code have never implemented it on such a scale before. Unfortunatly we have no way of changing the code or the SQL (which is why we are migrating to the newer version which is a little more Oracle friendly) and 9r2.
Specifics:
We have a 2.5MB redo buffer - we are not getting redo buffer waits but we do get high 'log sync waits'. I feel this is a cumulative problem - when the session commits, it waits for LGWR to flush the redo buffer. As we have 100 commits a second, it means we have 100 times the "wait time" recorded - so this may not be indicative of the problem per say.
Yes - I think REDO speed is a part of this - we have dedicated REDO drives (currently a pair of JBODs multiplexed) with 6 members in 2 groups each 256MB. We switch every 15 mins or so during the day.
From what I see, ARCHIVE has little impact. It takes 10 to 20 seconds to copy the redo to archive - there is no difference between that 20 second copy remaining 14.5 minutes when it's just filling up redo logs.
We have experimented on a test system (without the same load) and get acceptable CTAS performance on it with single drives. It's a bit more difficult on the prod system to test.
Here is the top of the stats pack for a 10 minute snapshot:
Code:
STATSPACK report for
DB Name DB Id Instance Inst Num Release OPS Host
------------ ----------- ------------ -------- ----------- --- ------------
XXXXXX 532351126 XXXXXX 1 8.1.7.4.0 NO XXX-DATASRV
Snap Id Snap Time Sessions
------- ------------------ --------
Begin Snap: 7575 03-Jun-05 10:24:32 115
End Snap: 7576 03-Jun-05 10:34:48 115
Elapsed: 10.27 (mins)
Cache Sizes
~~~~~~~~~~~
db_block_buffers: 200000 log_buffer: 2621440
db_block_size: 8192 shared_pool_size: 200000000
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 236,471.74 2,580.18
Logical reads: 16,930.78 184.73
Block changes: 710.96 7.76
Physical reads: 2,810.45 30.67
Physical writes: 69.84 0.76
User calls: 5,009.34 54.66
Parses: 243.37 2.66
Hard parses: 11.26 0.12
Sorts: 9.51 0.10
Logons: 0.13 0.00
Executes: 2,677.66 29.22
Transactions: 91.65
% Blocks changed per Read: 4.20 Recursive Call %: 3.58
Rollback per transaction %: 0.00 Rows per Sort: 37.45
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 100.00 Redo NoWait %: 100.00
Buffer Hit %: 83.40 In-memory Sort %: 100.00
Library Hit %: 98.98 Soft Parse %: 95.37
Execute to Parse %: 90.91 Latch Hit %: 99.93
Parse CPU to Parse Elapsd %: 96.01 % Non-Parse CPU: 98.03
Shared Pool Statistics Begin End
------ ------
Memory Usage %: 43.71 43.73
% SQL with executions>1: 86.17 86.17
% Memory for SQL w/exec>1: 82.55 82.56
Top 15 Wait Events
~~~~~~~~~~~~~~~~~ Wait % Total
Event Waits Time (cs) Wt Time
-------------------------------------------- ------------ ------------ -------
log file sync 56,757 225,427 63.60
db file sequential read 1,729,916 93,853 26.48
log file parallel write 39,154 29,790 8.40
db file parallel write 576 2,203 .62
SQL*Net more data to client 540,735 2,161 .61
control file parallel write 206 508 .14
log file sequential read 4,194 163 .05
log file switch completion 5 75 .02
latch free 875 70 .02
SQL*Net break/reset to client 14,042 69 .02
file open 4,455 65 .02
db file scattered read 239 35 .01
enqueue 26 21 .01
buffer busy waits 103 13 .00
refresh controlfile command 110 4 .00