Odd Database issue with sudden I/O wait

**roadwarriorDBA** · 02-04-2006, 02:27 PM

This DB is the backend to Oracles AS/Portal, its the MDR/Infrastructure.

The tuning has always been about CPU reduction, always been CPU bound. Twice Oracle recommended using less memory and I/O wait has never been the issue it usually is inside a Datawarehouse. We successfully load tested 2200 simulated concurrent users on the stack. We were all ready to head home when friday and launch sunday night when the application engineers were re-deploying their application and the Database went to 99% I/O Wait. Nothing could be identified as the cause. Eventually we restored our backups from the night before and to our astonishment the high I/O wait resumed immediately.

I got a Sev 1 Tar opened up and worked it through the nite with support in India. They never turned anything up with the DB. Hours later it dropped from 99% I/O Wait to much lower but still higher than at any time in the past, much too high for an idle DB connected to 2 AS servers.

I still think this is an I/O subsystem degradation due to a broken mirror or something, semi-failed controller. Hardware guys dismiss it immediately.

Any ideas ? Statspack, Enterprise manager have been utilized / uploaded / pondered but to no avail.

I say hardware.

**pando** · 02-04-2006, 02:34 PM

Well what process was haging the I/O system? Did you check what process at DB level is doing I/O activity?

**roadwarriorDBA** · 02-04-2006, 03:17 PM

DBW0, we took an Oracle debug trace and sent it up to metalink. They said after an hour of web conferencing that all was normal. We got transferrred to the portal group at that point where we will likely go nowhere on this issue.

Also its interesting its writing 903 blocks per second on a database that is not active...earlier we had

Now its more normal..

Sat Feb 4 12:13:39 2006
Private_strands 7 at log switch
Thread 1 advanced to log sequence 340
Current log# 3 seq# 340 mem# 0: /oldoradata/ONLINE_LOGS/redo03.log
Sat Feb 4 12:17:23 2006

Earlier it was switching rapidly second apart, I would recreate larger online redo logs but this thing has been fine in this configuration all week under load test.

ARCH: Connecting to console port...
Committing creation of archivelog '/opt/backup/ARCHIVE_LOGS/1_40_581465171.dbf'
Fri Feb 3 22:32:18 2006
ARC1: Completed archiving thread 1 sequence 40 (19129030-19135397) (asdb)
ARCH: Connecting to console port...
Fri Feb 3 22:32:21 2006
Thread 1 cannot allocate new log, sequence 42
Checkpoint not complete

I have disabled archive logging to reduce the contention, I am turning it back on now to see how it responds.

Also system tablespace was seen growing around 2:00 am last night and had increased by 10MB, seems stable now and I cannot find any users set to use its temp tablespace.

USERNAME TEMPORARY_TABLESPACE
------------------------------ ------------------------------
PERFSTAT TEMP
OWF_MGR TEMP
ORASSO_PA TEMP
ORASSO_PS TEMP
ORASSO_DS TEMP
ORASSO TEMP
WIRELESS TEMP
ORASSO_PUBLIC TEMP
ODS TEMP
ORABPEL TEMP
BAM TEMP

USERNAME TEMPORARY_TABLESPACE
------------------------------ ------------------------------
B2B TEMP
UDDISYS TEMP
WCRSYS TEMP
DSGATEWAY TEMP
DISCOVERER5 TEMP
ORAOCA_PUBLIC TEMP
OCA TEMP
PORTAL TEMP
PORTAL_APP TEMP
PORTAL_DEMO TEMP
PORTAL_PUBLIC TEMP

USERNAME TEMPORARY_TABLESPACE
------------------------------ ------------------------------
DCM TEMP
WEBDOMAIN TEMP
RESQUEST TEMP
PREFSTORE TEMP
SPOTLIGHT TEMP
SCOTT TEMP
IP TEMP
MDDATA TEMP
DIP TEMP
DBSNMP TEMP
WKSYS TEMP

USERNAME TEMPORARY_TABLESPACE
------------------------------ ------------------------------
SYSMAN TEMP
WKPROXY TEMP
MDSYS TEMP
WMSYS TEMP
CTXSYS TEMP
ANONYMOUS TEMP
SI_INFORMTN_SCHEMA TEMP
ORDSYS TEMP
EXFSYS TEMP
DMSYS TEMP
WK_TEST TEMP

USERNAME TEMPORARY_TABLESPACE
------------------------------ ------------------------------
XDB TEMP
ORDPLUGINS TEMP
OLAPSYS TEMP
MGMT_VIEW TEMP
SYS TEMP
SYSTEM TEMP
OUTLN TEMP

51 rows selected.

**roadwarriorDBA** · 02-04-2006, 03:39 PM

TOP process is oracleasdb ( sid is asdb )

Under cd /proc/6199 we have

lrwx------ 1 oracle oinstall 64 Feb 4 12:32 0 -> /dev/null
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 1 -> /dev/null
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 10 -> /oradata/asdb/ptli dx.dbf
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 11 -> /oradata/asdb/port al.dbf
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 12 -> /oradata/asdb/temp 01.dbf
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 13 -> /oradata/asdb/ptld oc.dbf
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 2 -> /dev/null
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 27 -> socket:[10326829]
lr-x------ 1 oracle oinstall 64 Feb 4 12:32 3 -> /dev/null
lr-x------ 1 oracle oinstall 64 Feb 4 12:32 4 -> /dev/null
lr-x------ 1 oracle oinstall 64 Feb 4 12:32 5 -> /dev/null
lr-x------ 1 oracle oinstall 64 Feb 4 12:32 6 -> /dev/null
lr-x------ 1 oracle oinstall 64 Feb 4 12:32 7 -> /dev/zero
lr-x------ 1 oracle oinstall 64 Feb 4 12:32 8 -> /u01/app/oracle/pro duct/10.1.2.0.2/infra/rdbms/mesg/oraus.msb
lrwx------ 1 oracle oinstall 64 Feb 4 12:32 9 -> /oradata/asdb/undot bs01.dbf

iostat shows 900k/s second written and then this process dies and another replaces it and the song remains the same. This is generating the extra redo and I/O wait.

cciss/c0d1 43.21 83.44 900.20 2445474 26383248
cciss/c0d1p1 43.21 83.43 900.20 2445210 26383248

Portal is doing the activity.

**marist89** · 02-04-2006, 11:59 PM

oracleasdb

than that's the process that's doing all the work. I'd run logminer on the archived redo logs and see who's doing what...

**roadwarriorDBA** · 02-06-2006, 09:30 AM

The phantom load ( twin 99% IO WAIT on 2 cpus ) that returned immediately after restore is still a mystery. It concluded after 3 hours with archive logmode turned off.

It likely had to do with this nasty problem.

HUGE PROBLEM IDENTIFIED:

Code:

SUM(BYTES) SEGMENT_NAME
---------- --------------
1539309568 WWCTX_SSOS_IX2
 134414336 WWCTX_SSOS_PK
1213202432 WWCTX_SSO_SESSION$-->JOB NOT RUNNING TO CLEANUP TABLE
  64028672 WWSTO_SESSION$
 491847680 WWSTO_SESSION_DATA$

RUNNING THE CLEANUP JOB IS NOT GOING TO WORK IT TOOK 1 HOUR TO DELETE JUST 1000 BLOCKS AND IT HAS 3 MILLION ROWS IN 140,000 BLOCKS.

Contacting Oracle to see if they have a script for dropping recreating this session table as delete is taking too long so I killed it.

The Portal session cleanup job not running is a common problem from what I am reading.

Thread: Odd Database issue with sudden I/O wait

Thread Tools

Display

Odd Database issue with sudden I/O wait

Posting Permissions