How and when do alerts or
informational messages about what’s taking place inside your database make
their way out to you, the DBA par excellence? There are several ways, some free
and some hand-crafted, to expose alerts and messages. Let’s face it, as the
Oracle RDBMS engine becomes more and more Skynet-Terminator-Judgement
Day-aware, keeping track of what’s taking place inside an instance has become
easier and harder at the same time. The easier aspect of this statement is evidenced
by more sophisticated monitoring tools and interfaces, and the harder part is
borne out by the sheer number of metrics that are available to monitor.
Let’s start off with a
simple peek inside the database option.
Tail the Alert Log
A commonly used quick and
dirty monitoring tool in UNIX-based environments (AIX, HP, Solaris, and Linux)
is a simple script to tail “X” number of lines out of the alert log, and then
search (grep) the extract for whatever is of interest to you. Specific
ORA-xxxxx errors can be searched, or to make things even simpler, the search
can be based on any ORA error. If an ORA error appears, then an email is fired
off via a mail transfer agent (MTA) to one or more addresses.
The steps can be summarized
by the shell script pseudo code below:
#! /usr/bin/ksh
tail $ORACLE_HOME\bdump\alert_<SID>.log > alert.log
COUNT=`grep ORA alert.log | wc -l`
if [$COUNT is something other than zero or an empty string]
then
mail -s "Check alert log" dbaalerts@company.com < alert.log
fi
Several features need to be
in place for this scheme. First, whomever (as a person or machine user such as
oracle) is running, the script needs to have appropriate file system permissions
to be able to read $ORACLE_HOME and write to wherever.
Second, your MTA can be as
simple as “mail” (or mailx, depending on your flavor/version of UNIX). Chances
are your UNIX admin already has UNIX mail working as no doubt much of his or
her watchfulness is notification after the fact as opposed to scanning logs all
day long (which is pretty much what this is for you as well).
Third, you need something to
read mail yourself, so that implies something along the lines of
Outlook/Exchange Server in your company’s office. Assuming you have been
assimilated by the Borg, oops, I mean Microsoft, then the email address shown
in the example would stand out to those familiar with aliases or mail groups. Otherwise,
have the script “cat” a file with email addresses in it and loop through the
addresses.
Fourth, you need something
to execute the tail job on a periodic basis as you are pulling the alert log
information as opposed to it being pushed to you, and what better than a cron
job to mange this aspect of the process. The cron can run every ten minutes (as
an example) all week long. While crons are very reliable, what the job cannot
do is guarantee you that it will catch an ORA error. One way to help ensure
that your tail of 100 lines does not miss the ORA error at the 101st
line (i.e., you missed it by one line) is to grab enough lines to increase the
likelihood that the extract will contain at least the last ten minutes of alert
log activity. Better to grab too much than not enough of the alert log.
As a variation on what is
emailed to you, don’t include the entire alert log extract in the DBA alert
email. You only need a subject line telling you to inspect the alert log as
opposed to sending (and waiting) multiple KB worth of text, especially if
you’re receiving email on a PDA while on call.
Check for Required Processes
A variation (or complement)
of the alert log scan is an existence check for required processes. As a
minimum, does the script need to check for PMON, SMON, DBWn, LGWR, and CKPT?
The answer is not really – checking for PMON by itself, as an example, is
sufficient in and of itself. No PMON means no instance, which in turn means no
running database (assuming a single instance/single database pairing).
Between an alert log scan
and an instance checking “is my database up” script, the instance checking
version is more of a superset of the alert log scan. Here is why this is so: is
an alert log going to be written to if the instance is no longer running?
Or looked at this way, can
an instance still be viable if it encounters or detects an ORA error? Yes it
can, and a deadlock is an excellent example of this scenario. Deadlock
detected, trace file info is written to the alert log, one session’s
transaction is essentially cancelled, and life goes on because absolutely
nothing is wrong with the database. Remember, Oracle’s philosophy on deadlocks
is that when they do occur, it is because of something you caused via code, not
something that is a shortcoming or error within Oracle.
Knocking down an instance by
killing a required process typically generates alert log information, and can
be easily demonstrated. On Windows, use the orakill utility to kill a SPID
associated with a SID (kill -9 PID counterpart in UNIX). Use a query like the
one below to obtain a SPID.
select c.name, b.spid, a.sid
from v$session a, v$process b, v$bgprocess c
where c.paddr <> '00'
and c.paddr = b.addr
and b.addr = a.paddr;
NAME SPID SID
----- ------------ ----------
PMON 288 170
MMAN 536 168
DBW0 2596 167
LGWR 3936 166
CKPT 3252 165
SMON 3400 164
RECO 2432 163
We’ll use 288 (for PMON) as
one of the parameters for orakill.
The alert log then records
information about instance failure, and you can see the ripple effect among the
trace files related to other processes (not all alert entries are shown)..
Tue May 22 01:46:24 2007
LGWR: terminating instance due to error 472
Tue May 22 01:46:25 2007
Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_ckpt_3252.trc:
ORA-00472: PMON process terminated with error
Tue May 22 01:46:26 2007
Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_dbw0_2596.trc:
ORA-00472: PMON process terminated with error
Tue May 22 01:46:31 2007
Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_reco_2432.trc:
ORA-00472: PMON process terminated with error
Tue May 22 01:46:31 2007
Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_smon_3400.trc:
ORA-00472: PMON process terminated with error
Instance terminated by LGWR, pid = 3936
Going beyond alert logs and background processes
We can get much more
information about what’s going on inside a database with the DBMS_SERVER_ALERT
built-in PL/SQL package. In fact, more than 140 metrics are available, and the
alert threshold values for many of these can be adjusted to suit your
particular needs.
One alert or metric you may find
to be useful involves the detection of blocking, the “silent” show stopper of
Oracle. Blocking can go on for hours and hours with no discernible or
externally noticeable signs of it taking place. Blocking is usually detected
when users start to complain about hung sessions, followed by calls about not being
able to log in, and when scripted jobs fail to complete (noticed by you or
others). Aside from manually detecting blocking, wouldn’t it be nice to be
alerted when Oracle detects a blocking situation? In Oracle 10g, we can do
exactly that.
One of the configurable
metrics is for blocked user sessions, and it comes with its own graph. The
“Metric Value” picture below is a result of the competing update statements
shown in the SQL*Plus session windows (with an output of the blocking info
below that).
Blocking is really quite
insidious, and user sessions in an OLTP database can stack up in no time at
all. From a customer service perspective, you can be certain your company would
hate to have customers dissatisfied with your Web site that manages personal
account information, mailing/shipping preferences, and any number of service
oriented functionality. With server managed alerts, you can be one of the first
to know about this situation as opposed to being practically the last to know.
In Closing
In the next article about
serving up server alerts, we’ll go into detail about two ways to configure and
manage server alert/metric settings: using the DBMS_SERVER_ALERT package and
its GUI counterpart in Database Control.
Back to DBAsupport.com