DBAsupport.com Forums - Powered by vBulletin
Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: Log Switching / Checkpointing Query

  1. #1
    Join Date
    Nov 2001
    Location
    Sheffield, England, UK
    Posts
    78

    Log Switching / Checkpointing Query

    I have been examining the various wait times on non-idling events in my database recently, and also cross referenced these to CPU usage (service time) to evaluate the overall response time for each event.

    I was surprised to note that logging operations appeared to figure at 20-30% of the total wait time, like as follows:

    %Age Event

    56.16 CPU
    21.44 log file sync
    8.49 log file parallel write

    I also noticed that there were 50 timeouts on 'log file sync', and although service time was lower... there were 402 timeouts on 'latch free'.

    Looking at the latches, there were 5,000+ sleeps and the noted timouts on the 'redo writing' latch.

    I found that checkpointing wasn't logging to the alert log and switched this on, and thereby found that checkpoints were occurring in-between log switches.

    So, I also set the checkpoint timeout to 0 and the log_checkpoint_interval to some significantly high value, such that checkpoints are now only occurring at log switch time.

    There was no apparent improvement in performance.

    I wondered if asynchronous I/O wasn't functioning too well, so decided to run some tests on a non-production system (copy of live). I moved from multiple db_writer_processes (of 8), to the equivalent in dbwr_io_slaves, to see if this would help.

    No apparent change. In fact, the wait times for the new associated events seemed to be worse than the previous setup:

    %Age Event

    28.28 CPU
    24.27 io done
    22.53 log file parallel write
    20.82 log file sync

    I wondered if the amounts of redo being written out to the logs when LGWR was activated was too great, and that not enough background writes were able to take place, such that we were waiting on sync writes all the time.

    Consequently, I lowered the log_buffer by varying amounts.

    Historically the buffer was set quite high at 2097152, so LGWR we were probably waiting for 683K of redo to be written each time.

    Lowering the buffer however seemed to immediately start giving me waits for log space.

    So - I decided to cheat at this point, and introducted the _log_io_size parameter into the equation. I left the log_buffer sized at it's highest level, figuring it didn't really matter any more, if was manually setting the io size for LGWR.

    I set this to around 1024 log blocks (which on our system are 512 bytes), to see if there was any improvement.

    At this point, there were no apparent improvements at all on response times. However, I do notice that the latch free waits/timeouts have almost disappeared.

    I am currently running a test of the various parameters, but with dbwr_io_slaves switched off again, in favour db_writer_processes instead.

    One final item.

    I'm noticing in the alert log that checkpointing is beginning with each log switch... but we don't get a completion of the checkpoint until some 15 minutes later... and 1 minute (or less) after that... another checkpoint begins.

    Our redo logs are currently sized at 50M. On the whole log switches are averaging a switch every 45 minutes, though sometimes we get switches after 2.5 minutes.

    I'm thinking I might upsize the logs - but this is almost solely based on the fact of these checkpoints and the way they are appearing in the alert log.

    It seems stupid that the checkpoint starts, and then almost as soon as it finishes, it starts another one. I thought we might at least have some time in-between for background writes.

    Anyone got any comments on all this?

    Am I losing my marbles somewhere, and got some fatal flaws on my approach to all this?

    :-)

    - Tony.

    ENC: [I enclose a little snippet from the alert log]
    Attached Files Attached Files

  2. #2
    Join Date
    Sep 2000
    Location
    Chennai, India
    Posts
    865

    Re: Log Switching / Checkpointing Query

    Originally posted by AJW_ID01
    Our redo logs are currently sized at 50M. On the whole log switches are averaging a switch every 45 minutes, though sometimes we get switches after 2.5 minutes.
    Double the size of your redo logs 50M to 100M. Log switches at around every 45 minutes is OK, but not 2.5 minutes.

    HTH.

  3. #3
    Join Date
    Nov 2001
    Location
    Sheffield, England, UK
    Posts
    78
    Any suggestions anyone on the increased waits using io slaves, instead of more DBWR processes?

    I specifically altered the DISK_ASYNCH_IO to false, to see if use of io_slaves improved. [NOPE!]

    Have now reverted to dbwr_processes instead, as this seems to be best.

    Can't check what different resizing online logs will do, as our test redo disks seem full right now!

    I AM wondering if heavy contention for these disks from the umpteen test environments on the box is severely skewing wait times for this one database I'm viewing.

    Might see if I can squeeze some larger redo files somewhere else, and test that.

    Many thanks for your response so far...

    Still reviewing! :-)

  4. #4
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092

    Re: Re: Log Switching / Checkpointing Query

    Originally posted by ggnanaraj
    Double the size of your redo logs 50M to 100M. Log switches at around every 45 minutes is OK, but not 2.5 minutes.
    Why? If your LGWR can keep up why not switch every 2 minutes?
    Jeff Hunter

  5. #5
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092
    Are /export/data/redo3/ and /export/data/redo4/ different physical disks or filesystems?

    How busy are these disks?

    Typically when you see events such as "log file parallel write" there is contention on the device where one or both of the logs are kept.
    Jeff Hunter

  6. #6
    Join Date
    Nov 2001
    Location
    Sheffield, England, UK
    Posts
    78
    Hmmm. Well, whilst I might know enough to get by in Oracle, my Unix skills are not SO hot.

    Can anyone help with some OS commands to check out disk usage?

    I've scoured around for a document on the disk layout.

    Redo 3 & 4 are physically different disks which are not mirrored. We rely on DB multiplexing for log files.

    It appears those same disks however have space assigned to other volumes - where there may well be some other contention occurring.

    If I can track this down at the OS level, I may be able to challenge someone about it.

    Any pointers for where my next little investigation ought to take me, command-wise?

    :-)

    - Tony.

  7. #7
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092
    Originally posted by AJW_ID01

    Can anyone help with some OS commands to check out disk usage?
    Code:
    iostat -xdM 10
    Look at %b, this is %busy.


    Code:
    sar 5 10
    Look at %wio. This will tell you how much of your CPU is waiting on I/O.

    Code:
    cd /export/data/redo3/
    df -kl .
    This will tell you which physical disk /export/data/redo3 is on.
    Jeff Hunter

  8. #8
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092
    Originally posted by AJW_ID01
    It appears those same disks however have space assigned to other volumes - where there may well be some other contention occurring.
    Sysadmins... Can't live with them, can't live without them.
    Jeff Hunter

  9. #9
    Join Date
    Nov 2002
    Location
    Geneva Switzerland
    Posts
    3,142
    I'm a bit mystified - is there a real performance problem? Pct waits will always add up to 100%. And 56% waiting on CPU . . . . is the machine really CPU bound?
    "The power of instruction is seldom of much efficacy except in those happy dispositions where it is almost superfluous" - Gibbon, quoted by R.P.Feynman

  10. #10
    Join Date
    Nov 2001
    Location
    Sheffield, England, UK
    Posts
    78
    Yes, the CPU has to do some real crunching on this box, unfortunately.

    Daily, this can drop down to 0% idle.

    It's a Sun E6000, with 8 CPUs online, 4G memory, 10G swap.

    On live, it's about the same, only we have more memory available on there. (Double).

    Of course, we'll always be expecting some waits - I was just concerned about where they were - in respect of the redo, and wondered if we could at least look to lessening this.

    The database in the live environment is a high profile warehouse, and when our batch gets delayed we suffer no-end of flak from our business users. Failure to meet SLAs generally costs us more than my daily wage... so I prefer to keep everyone happy if I can.

    Someone asked me to take a look at performance, after some proverbial s**t hit the fan last week when I was off.

    I just thought I might be able to tweak this redo thing.

    :-)

    - Tony.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Click Here to Expand Forum to Full Width