I have been examining the various wait times on non-idling events in my database recently, and also cross referenced these to CPU usage (service time) to evaluate the overall response time for each event.
I was surprised to note that logging operations appeared to figure at 20-30% of the total wait time, like as follows:
I also noticed that there were 50 timeouts on 'log file sync', and although service time was lower... there were 402 timeouts on 'latch free'.
Looking at the latches, there were 5,000+ sleeps and the noted timouts on the 'redo writing' latch.
I found that checkpointing wasn't logging to the alert log and switched this on, and thereby found that checkpoints were occurring in-between log switches.
So, I also set the checkpoint timeout to 0 and the log_checkpoint_interval to some significantly high value, such that checkpoints are now only occurring at log switch time.
There was no apparent improvement in performance.
I wondered if asynchronous I/O wasn't functioning too well, so decided to run some tests on a non-production system (copy of live). I moved from multiple db_writer_processes (of 8), to the equivalent in dbwr_io_slaves, to see if this would help.
No apparent change. In fact, the wait times for the new associated events seemed to be worse than the previous setup:
I wondered if the amounts of redo being written out to the logs when LGWR was activated was too great, and that not enough background writes were able to take place, such that we were waiting on sync writes all the time.
Consequently, I lowered the log_buffer by varying amounts.
Historically the buffer was set quite high at 2097152, so LGWR we were probably waiting for 683K of redo to be written each time.
Lowering the buffer however seemed to immediately start giving me waits for log space.
So - I decided to cheat at this point, and introducted the _log_io_size parameter into the equation. I left the log_buffer sized at it's highest level, figuring it didn't really matter any more, if was manually setting the io size for LGWR.
I set this to around 1024 log blocks (which on our system are 512 bytes), to see if there was any improvement.
At this point, there were no apparent improvements at all on response times. However, I do notice that the latch free waits/timeouts have almost disappeared.
I am currently running a test of the various parameters, but with dbwr_io_slaves switched off again, in favour db_writer_processes instead.
One final item.
I'm noticing in the alert log that checkpointing is beginning with each log switch... but we don't get a completion of the checkpoint until some 15 minutes later... and 1 minute (or less) after that... another checkpoint begins.
Our redo logs are currently sized at 50M. On the whole log switches are averaging a switch every 45 minutes, though sometimes we get switches after 2.5 minutes.
I'm thinking I might upsize the logs - but this is almost solely based on the fact of these checkpoints and the way they are appearing in the alert log.
It seems stupid that the checkpoint starts, and then almost as soon as it finishes, it starts another one. I thought we might at least have some time in-between for background writes.
Anyone got any comments on all this?
Am I losing my marbles somewhere, and got some fatal flaws on my approach to all this?
:-)
- Tony.
ENC: [I enclose a little snippet from the alert log]
Originally posted by AJW_ID01 Our redo logs are currently sized at 50M. On the whole log switches are averaging a switch every 45 minutes, though sometimes we get switches after 2.5 minutes.
Double the size of your redo logs 50M to 100M. Log switches at around every 45 minutes is OK, but not 2.5 minutes.
Any suggestions anyone on the increased waits using io slaves, instead of more DBWR processes?
I specifically altered the DISK_ASYNCH_IO to false, to see if use of io_slaves improved. [NOPE!]
Have now reverted to dbwr_processes instead, as this seems to be best.
Can't check what different resizing online logs will do, as our test redo disks seem full right now!
I AM wondering if heavy contention for these disks from the umpteen test environments on the box is severely skewing wait times for this one database I'm viewing.
Might see if I can squeeze some larger redo files somewhere else, and test that.
Originally posted by AJW_ID01 It appears those same disks however have space assigned to other volumes - where there may well be some other contention occurring.
Sysadmins... Can't live with them, can't live without them.
I'm a bit mystified - is there a real performance problem? Pct waits will always add up to 100%. And 56% waiting on CPU . . . . is the machine really CPU bound?
"The power of instruction is seldom of much efficacy except in those happy dispositions where it is almost superfluous" - Gibbon, quoted by R.P.Feynman
Yes, the CPU has to do some real crunching on this box, unfortunately.
Daily, this can drop down to 0% idle.
It's a Sun E6000, with 8 CPUs online, 4G memory, 10G swap.
On live, it's about the same, only we have more memory available on there. (Double).
Of course, we'll always be expecting some waits - I was just concerned about where they were - in respect of the redo, and wondered if we could at least look to lessening this.
The database in the live environment is a high profile warehouse, and when our batch gets delayed we suffer no-end of flak from our business users. Failure to meet SLAs generally costs us more than my daily wage... so I prefer to keep everyone happy if I can.
Someone asked me to take a look at performance, after some proverbial s**t hit the fan last week when I was off.
I just thought I might be able to tweak this redo thing.