-
Currently we have 2 x 140GB drives reserved for REDO directly attached.
I'm very much considering moving to raw for redo logs. We have 6 independant drives (each 140GB) on a fibre back plane - direct attached, not SAN. I don't think the array controller is a bottleneck - it would not even break into a sweat with 6 drives.
Q. How would I configure them? My thoughts are:
Three sets of mirrored drives. Each mirror contains a log member.
i.e.
Drive 1 - REDO1A
Drive 2 - REDO1B
Drive 3 - REDO2A
Drive 4 - REDO2B
Drive 5 - REDO3A
Drive 6 - REDO3B
and then Drive 1 & 2 would contain REDO4A/B, drive 3 & 4 REDO05A/B and drive 5 & 6 REDO06A/B to give 6 members - each 256MB in size to keep the 15 minute switch rate.
Archiving goes to a 470GB 9*72GB RAID5 array (and soon to a Standby server over the WAN)
Afriad my SUN hardware knowledge is a bit basic - I have experienced engineers to do the physical stuff - but they don't know Oracle very well!
So - do we think this REDO configuration allow us to do DML during the working day without killing the OLTP users?
__________________
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Guy Wicks
Database Manager
The British Library
Last edited by hrishy; 06-06-2005 at 06:17 AM.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Guy Wicks
Database Manager
The British Library
-
Hi
I think thats okay to have 6 groups with two members each per group.And your redolog layout also seems okay to me.
Now whether this would solve your problem well that can only be answered by actually doing the task. :-)
You can also clone your Prod db to a test instance and see whether we can really reproduce your problem ?
regards
Hrishy
-
Originally posted by blguy
Q. How would I configure them? My thoughts are:
Three sets of mirrored drives. Each mirror contains a log member.
i.e.
Drive 1 - REDO1A
Drive 2 - REDO1B
Drive 3 - REDO2A
Drive 4 - REDO2B
Drive 5 - REDO3A
Drive 6 - REDO3B
and then Drive 1 & 2 would contain REDO4A/B, drive 3 & 4 REDO05A/B and drive 5 & 6 REDO06A/B to give 6 members - each 256MB in size to keep the 15 minute switch rate.
I don't see the point of spreading on three sets of drives. LGWR will write to one pair (on seperate controllers, right?). Then ARCH will read the logfile and send to log_archive_dest_n and/or standby while LGWR is working on another pair. Since the ARCH process is so quick (most likely), the time spent reading of the log that was just switched is minimal. Eight disks would be ideal. You could setup two 0+1 filesystems and write one member to each filesystem (as long as you're using hardware RAID).
For what it's worth, I think you're masking your real problem with hardware.
Jeff Hunter
-
Jeff,
I think your right - the fact the system runs OK during the day is just brute force - a massive overspec of the base system. Generally the system bumbles along on 4 to 6 CPU's active (using top) with the rest idle/sleeping. It may not be nice, but it works.
(This is a basically a government procurement - of course we got screwed eh?)
Will arrange for the redo drives to be reconfigured at the next scheduled down time (need to change control etc etc etc) and I'll see if it has any improvement on the flexibility of the system.
In answer to my original question - Would you expect a CTAS to cripple this system? - I think the answer is a definate no, and I'm glad everyone else agrees. I thought I might be going mad!
Anyway - got to get onto setting up a SQLServer cluster with replication. Oh Joy!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Guy Wicks
Database Manager
The British Library
-
Have you done the stress test and the new statspack report? I wouldnt suggest you to change anything without finding out the root cause!
Your CTAS can hang your system because of redo log issues. I am saying it because you said if you use NOLOGGING the system runs fine, without your users complains.
-
I feel there is some problem even when operations like CTAS are NOT running. Response time is comprised of @ 15% service time and @ 85% wait time. Almost every commit is waiting for "log file sync".
I sincerely suggest you to check everything @ the disks where your online redo logs are hosted. Do your ever see at least one of the redo group status as INACTIVE ? How big are the online redo logs ? How many archive logs are getting generated in a day and during peak times ?
svk
-
svk
I've never seen INACTIVE redo groups.
We have 5 members per group, each member 256MB. Average log switch time every 10 to 15 minutes during the day.
Here are the results of the CTAS test just before the users arrive (the reading rooms open at 9:30 BST) - this would of upset a few internet web users tho!
This is the SQLPLUS script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Guy Wicks
Database Manager
The British Library
-
Stats pack report for processing during CTAS logging exercise
sp_8131_8133.txt
And a couple of Spotlight captures:
Notice the RED on log writer - the average log write time can go over 1,000 ms (1 sec) - so is the drive is being swamped? With over 2,500 blocks per second (8K block) that is over to be 20GB/sec transferred. That's high for any single drive.
See the orange block grow as the number of active sessions get seen. This is how I know I affect the online users.
Why would the CTAS saturate the REDO logs and prevent the OLTP sessions from processing. Can it not prioritise the small OLTP sessions?
I feel that doing the CTAS exacerbates a current problem - the system is 'out-of-balance' Very fast CPU / Datafile but slow REDO. However, this is all proper SUN kit!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Guy Wicks
Database Manager
The British Library
-
Man you didnt see what users were waiting by looking v$session_Wait when you ran CTAS?!
-
Also, the first snap were 10 minutes, this second one is 4 minutes!
However if you look the log file sync waits
Code:
First SNAP without CTAS:
log file sync 56,757 225,427 63.60
Second SNAP with CTAS:
log file sync 7,697 686,518 93.81
The wait time of second SNAP with only around 13% of first SNAP waits is MUCH MUCH longer than the first SNAP, clearly shows you have serious I/O problems with your redo log I/O subsystem
Code:
SECOND SNAP
Event Waits Timeouts Time (cs) (ms) /txn
---------------------------- ------------ ---------- ----------- ------ ------
log file sync 7,697 3,637 686,518 892 1.9
db file sequential read 87,472 0 28,547 3 21.4
log file parallel write 349 0 9,263 265 0.1
db file parallel write 256 0 2,547 99 0.1
log file switch completion 26 3 1,179 453 0.0
log buffer space 108 0 1,043 97 0.0
buffer busy waits 264 2 898 34 0.1
control file parallel write 82 0 677 83 0.0
direct path write 161 0 372 23 0.0
FIRST SNAP
Event Waits Timeouts Time (cs) (ms) /txn
---------------------------- ------------ ---------- ----------- ------ ------
log file sync 56,757 0 225,427 40 1.0
db file sequential read 1,729,916 0 93,853 1 30.6
log file parallel write 39,154 0 29,790 8 0.7
db file parallel write 576 0 2,203 38 0.0
SQL*Net more data to client 540,735 0 2,161 0 9.6
control file parallel write 206 0 508 25 0.0
log file sequential read 4,194 0 163 0 0.1
log file switch completion 5 0 75 150 0.0
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|