Synopsis. Oracle DBAs have long wished to be able
to capture applications’ workloads in a production environment and then replay
that workload against a test environment to determine the impact of proposed
database or application changes on database performance. Oracle Database
11gR1’s new Database Replay feature enables an Oracle DBA to capture, massage,
and then replay workloads either selectively or in their entirety across a wide
range of database environments and platforms. This article provides a primer
for using Oracle 11g Database Replay to effectively predict the performance of
rapidly changing applications within increasingly fluid database environments.
If my time served in information technology has taught me
anything, it’s the utter truth behind Murphy’s Law: “Anything that can go
wrong will go wrong.” Over the years, I’ve also learned the veracity of
several corollaries to Murphy’s Law, including the fact that “interchangeable
parts won’t” and especially the rejoinder that “Murphy was actually an
optimist.” I hope to add my own corollary to these grim testimonials one
day with the following observation obtained at great pain: “Nothing works in
production like it does in the test environment.”
All attempts at ill humor aside, Oracle DBAs face a severe
challenge on a regular basis: to predict accurately how the next set of
changes to the database’s application code, database patch set, or hardware
configuration will affect negatively the entire database environment.
When I say the entire environment, I do mean that literally: every SQL
statement that any running application has executed, regardless if it’s
just a simple query or a complex batch job that issues a huge volume of DML
statements, must be captured. This challenge has become even more acute
because today’s typical application workloads are generated across a landscape
that spans myriad technologies: n-tier application servers, web farms, and even
client-server legacy applications. Moreover, when an application appears to be
performing poorly, the root cause of a detected performance issue is often
nearly impossible to track down; it may be the indirect result of an incorrect
network configuration, the application server’s configuration, or even the
application client’s environment variable settings.
The only chance to accomplish this lofty yet crucial goal is
via an application workload capture-and-playback product suite. This
type of application is specifically designed to capture the complete
workload that’s been executed against the current production database
environment (P+0) and then play back precisely that same workload
against the next iteration of the production environment (P+1). However,
my past experience has shown this means asking upper management to open the
company checkbook to purchase a relatively expensive third-party solution (e.g.
HP’s LoadRunner tool). The cost of licensing this software, the cost of
procuring and configuring the servers it runs upon, and most especially the
manpower expense to configure and prove out the capture-and-playback suite even
before workloads can be captured and replayed may easily run into the
six-figure USD range. And that’s why many IT organizations have surrendered to
the perception that it’s simply impossible to test for application system
regression in a cost-effective manner.
Beyond Performance Regression
My prior article series about Oracle Database 11g’s new SQL
statement performance tuning features – SQL
Performance Analyzer (SPA) and SQL
Plan Management (SPM) –
discussed how Oracle 11g allows the DBA to easily isolate SQL statements whose
performance has either progressed, remained unchanged, or regressed
after the application environment has been changed. Any robust capture and
playback suite must certainly capture and compare the differences between the original
(P+0) versus the future (P+1) system, application, and database
performance statistics, especially SQL statements that are now performing
dramatically worse. However, there are two other types of regression that any
robust capture-and-playback suite needs to address as well:
Error Regression. When a captured workload is being
played back, it’s not unusual to encounter errors, and in fact this is almost
certainly a desirable outcome. For example, I’d like to verify that an expected
exception like the violation of referential integrity (i.e. violation of
a primary key, foreign key, unique key, CHECK, or NOT NULL constraint) actually
arises and is trapped correctly. Also, I’d like to insure that an appropriate
and expected exception is raised whenever a crucial business rule is
violated. For example, if during payroll check processing an employee’s net
salary doesn’t balance to gross pay less all deductions, I’d like that
exception to be trapped as an error.
Therefore, any robust capture and playback suite must be
capable of monitoring three types of error regression:
-
Have all expected errors occurred?
-
Next, have any error conditions been raised that were not
expected? This obviously indicates that severe error regression is possible
because of system or application changes.
-
Finally, have any expected errors not occurred?
This condition is much more subtle, because it indicates that something
sinister has changed within the system or application, and crucial business
rules are either being misapplied or not applied at all.
Data Regression. Any robust capture and playback
suite also must pay strict attention to any variance in the data itself after
the playback has completed. For example, if I’m testing a mission-critical
financial system, I must insure that the identical financial transactions have
completed in the appropriate order and all accounts total to the same balances
in the P+1 environment as they would in the P+0 environment. If I don’t receive
identical results, I must assume that something in my application, database, or
environment has changed to prevent a perfect playback.
Another crucial feature of the capture and playback suite:
It must be able to insure that the captured workload is played back against the
P+1 environment only after that environment has been reset to reflect the state
of the P+0 environment at the time workload capture was initiated.
Otherwise, there’s a chance that “false positive” indications of data
regression will be detected that have nothing to do with changes to the
application, database, or environment.
The metaphor of a player piano roll especially appeals to me
when I imagine how a workload’s individual, dependent transactions need to be
captured and then played back: Once the pianist has finished playing a piece of
music, the piano roll has not only captured the exact notes performed,
but also the exact frequency at which individual keys were struck. In
essence, it gave the listener an exact duplicate that encapsulates the
maestro’s playing style, including all of her performance’s subtle pauses.
(Humorous aside to readers younger than 25: If you’ve never seen a player piano
roll, please substitute “MP3” or even “WAV file” … or ask one of your older
co-workers to explain how things were “back in the old days.”)
Database Replay: Features Summary
Thankfully, Oracle 11g’s new Database Replay (DBR)
suite offers all of the features I’ve just described. DBR permits an Oracle DBA
to:
-
Capture a workload that’s being generated on a production
database system. This includes the ability to capture concurrent execution of
the same SQL statements across multiple sessions while simultaneously gathering
all dependent transactions.
-
“Massage” the captured workload before it’s
eventually executed against a representative test system. This allows the DBA
to adjust the frequency at which the workload is played back, as well as
the capability to remap connections to different user sessions,
different services, or – in the case of playback on an Oracle 11g Real
Application Clusters (RAC) test system – one or more database instances.
-
Replay the captured and “massaged” workload on a test
system. The test system is configured to conform to the P+1 configuration so
that the DBA can accurately determine exactly how the workload will react to
any proposed system changes, including application changes, software
patches, and even hardware upgrades. The target for the test system
can be a test or QA database environment, or even a snapshot standby
database. (More on this last item in later articles.)
-
Perform regression analysis to highlight any differences
that exist between the P+0 and P+1 simulated workload. DBR will automatically
identify and analyze the vectors of any error regression, data
regression, or SQL statement regression.
The beauty of Database Replay is that it eliminates the
necessity to fashion simulated workloads to perform regression analysis.
Instead, the DBA can execute precisely the same SQL statements that have been
recorded, so this tends to provide a more accurate picture of system regression
because other extraneous factors (e.g. network latency) are reduced or
eliminated. Best of all, since all the recorded SQL statements that
comprise the workload are played back, there’s virtually no chance that even
seemingly insignificant or rarely executed code will be ignored as a possible
vector for regression. This can be crucial in determining if an application
will scale properly in a Real Application Clusters (RAC) clustered database
environment.
The next four sections of this article provide a high level
primer on how Database Replay features interact to accomplish their common
goal: an accurate determination of just how much regression can be expected
when migrating a production system from P+0 to P+1. In the next articles in this series, I’ll demonstrate how to utilize Database Replay to capture,
preprocess, replay, and analyze results from a DBR operation.
Phase 1: Recording a Workload
Oracle 11g Enterprise Manager Database Control provides an
intuitive interface to Database Replay features that enables capture, massage,
replay, and regression analysis of a workload for
performance, error, and data regression. The interface also provides excellent
feedback on the status of each phase of a Database Replay task set. Figure 1
below shows the initial screen that’s displayed when a Database Replay
operation is initiated by selecting the Database Replay
option from the Real Application Testing section of EM’s
Software
and Support tab:

Figure 1: Database Replay Master Panel.
During this phase of a Database Replay operation, the
complete workload as seen from the perspective of the production database is captured
and recorded. The DBA simply needs to insure that there is a sufficiently
“interesting” (i.e. pertinent) workload running on the production
system; DBR does the rest by capturing all executing SQL statements
issued from all external clients. This includes:
-
SQL queries, DML statements, and DDL statements
-
PL/SQL blocks and Remote Procedure Calls (RPCs)
-
Object Navigation requests and OCI calls
Note that while
the DBR capture operation is running, Oracle 11g doesn’t halt any running
background jobs; any internal clients continue making requests as well.
DBR records the
workload via a series of shadow processes. These shadow processes then
filter the necessary information to precisely reproduce the system workload,
and finally write this metadata into a series of XML files – one set of files
for each process -- for processing and playback at a later time. About the only
concern an Oracle 11g DBA might have, is whether there is sufficient space on
the file system for the numerous XML files that are written during this
process.
Phase 2: “Massaging” the Workload
Once a DBR workload has been recorded, it almost certainly
will need some slight adjustments before it’s replayed. For example, it may be
necessary to remap certain external client connections so they can be replayed
accurately in the P+1 environment. During this phase, DBR prepares
workload-specific metadata for its eventual replay, and any parameters that
might affect the outcome of the replay can be modified at this stage.
This preprocessing must occur on the same database version
as the replay, but as long as the database versions match, the scrubbing
operation can be performed on a production, test, or other database system. In
fact, Oracle highly recommends performing this metadata “massaging“ on a host
other than the production server so as not to affect the health or performance
of that server.
Phase 3: Replaying the Workload
Now that the workload has been massaged, it’s time to
initiate its replay. Once the metadata scrubbing described above is
done, designated DBR replay client(s) can replay the workload as often as
analyses are required.
Resetting the Test Environment. Before any replays
are initiated, however, the Oracle DBA must first “reset” the target database
and host environment being used for testing, because it’s vital that the test
server matches the production server before any changes are applied; otherwise,
unexpected regression may occur (a.k.a. a false positive). Fortunately,
this is much simpler to perform with the advent of FLASHBACK
DATABASE features in Oracle 10g. Other alternatives include normal
point-in-time incomplete recovery via RMAN, or even DataPump Export and Import.
Once the testing environment has been reset properly, the Oracle DBA next
applies all proposed changes to the production system on the test server so
that it is now in state P+1, and then she transfers the previously captured
workload to the P+1 server.
Replaying
the Workload via the Replay Driver. When it’s finally time to replay the
previously captured workload against the P+1 server, a special application
called the Replay Driver begins sending requests to the target RDBMS.
Since the Replay Driver is client-agnostic, it makes no difference to
Oracle 11g as to what type of client was sending the requests initially.
The Replay Driver consumes the recorded workload and sends appropriate requests
to the P+1 system so that it behaves as if external clients were actually
issuing the requests.
Since it will
distribute all workload capture streams between all replay
clients, the Replay Driver can take into account network bandwidth, CPU, and
memory capacity. The Replay Driver can also take advantage of remapped
connection strings so that they correlate one-to-one (i.e.
single-instance to single-instance) or many-to-one (i.e. single node to
several Real Application Cluster nodes), and that means that connection load
balancing can be taken into account. Just as important, the Replay driver will
ignore any activity that was originally generated by internal clients (e.g. EM
Database Control) and simply not replay that activity. It will also ignore any
recorded activity that utilized connections to external databases via database
links or accessed directory objects.
One other
intriguing advantage of using Database Replay: A captured workload can be
played back in either synchronous or asynchronous replay mode. In synchronous
mode, each transaction is replayed in exactly the order that it was recorded;
however, DBR can also replay a workload back in asynchronous mode, i.e.
without regard to transaction synchronicity, so that a heavier-than-recorded
workload can be generated. This is an especially useful feature when attempting
to perform a “test to destruction” of a new or modified database environment.
DBR Workload Replay Scope. Oracle 11gR1 Database
Replay can accurately evaluate the following types of changes to a database
environment during the workload replay phase:
-
Database upgrades
-
Database patches
-
Changes to database schemas
-
Changes to initialization parameters
-
Modifications to one or more Real Application Cluster (RAC)
nodes as well as their interconnect configurations
-
Upgrades to the operating system via OS patches and upgrades
-
OS platform modifications, including transitioning from 32-bit
to 64-bit environments
-
Changes to server memory or CPU configurations
-
Modifications to the database’s storage configurations,
including migration of database files between a named file system (e.g.
EXT3, NTFS), ASM storage, and/or raw storage
DBR Workload
Replay Limits. Database Replay’s simulation abilities do have some
noteworthy (and justifiable) limitations:
-
SQL*Loader direct path loading cannot be replayed;
however, conventional path SQL*Loader operations are indeed replayable.
-
Import and export operations, whether via traditional or
DataPump methods, are not replayable.
-
Oracle Shared Server sessions cannot be traced.
-
Flashback Database recoveries and Flashback Query operations are
not replayable.
-
Oracle Streams, including Advanced Queuing (AQ)
operations that aren’t PL/SQL based, can’t be replayed.
-
Distributed transaction processing, including remote COMMIT operations, is only replayable as local
transactions.
-
Oracle Call Interface (OCI)-based object navigation isn’t replayable.
For the most part, these limitations do make sense. For
example, a Flashback Database operation is essentially an incomplete recovery
of the database, so it’s not part of normal transaction processing, and I
certainly wouldn’t be concerned about whether its performance has regressed.
While the limitation against Shared Server sessions does make sense, there are
still a few database shops out there that do utilize Shared Server for
connection pooling, so this is a minor disappointment.
Phase 4: Regression Analysis
After the
workload replay is completed, Database Replay provides several different
analyses of how the replayed workload’s performance varied in the P+1
environment versus its original performance in the P+0 environment. As I
discussed at the beginning of this article, any good regression testing suite
has the capability to trap and analyze performance regression, data regression,
and error regression, and DBR definitely doesn’t disappoint on these features.
For example,
DBR can detect immediately any performance differences at a summary level via
its set of Capture and Replay reports. From these reports, it’s
possible to drill down into detailed analysis of database performance stored
within Automatic Database Diagnostic Monitor (ADDM) runs, Automatic
Workload Repository (AWR) reports, and Active Session History (ASH)
reports.
Whatever the source of the divergence, DBR post-replay
analysis can identify and handle two different flavors of divergence:
-
Online divergence may indicate that the database replay
has badly malfunctioned, and should probably be halted, as the results of the
replay are possibly less than meaningful.
-
Offline divergence is actually an expected result
of a successful database replay operation; this type of divergence is typically
detected and measured after the replay is completed.
Next Steps
Enough theory! Any tool this elegant deserves a
significant evaluation, so in the next article in this series, I’ll demonstrate
how to:
-
Capture a simple workload on a single-instance Oracle
11gR1 database
-
Massage (i.e. pre-process) that workload for eventual
replay
-
Replay the massaged workload on a two-node Oracle 11gR1
Real Application Cluster (RAC) clustered database
-
Identify possible issues that might arise during
transition to a similar target environment
References and Additional Reading
While I’m hopeful that I’ve given you a thorough grounding
in the technical aspects of the features I’ve discussed in this article, I’m
also sure that there may be better documentation available since it’s been
published. I therefore strongly suggest that you take a close look at the
corresponding Oracle documentation on these features to obtain crystal-clear
understanding before attempting to implement them in a production environment.
Please note that I’ve drawn upon the following Oracle Database 11gR1
documentation for the deeper technical details of this article:
B28274-01 Oracle Database
11gR1 Performance Tuning Guide
B28279-02 Oracle Database
11gR1 New Features Guide
Next
Back to DBAsupport.com