Assume that you are executing the SQL statement in SQLPLUS.
1 Oracle reads first 10M rows and then places them into buffer cache. And sends the data to User Workstation through NET8 (TCP/IP)
2 Now, Oracle reads 2nd 10M rows that must be placed in the buffer cache, but all the buffer blocks are filled with the 1st set of 10M rows. So it flushes (aged ) out the old buffer blocks and places the new data (2nd 10M rows) into the buffer cache. And sends the data to user's workstation.
This process goes on until it completes the fetching the rows from disk and sends out to workstation.
So it needs 10 reads logically to send 100M rows, but physically Oracle reads the data based on the I/O size, and DB_FILE_MULTIBLOCK_READ_COUNT. But remember, the max I/O size supported in one physical read cannot exceed 64K.
There are couple of things here that have to be clarified.
1. This is not directly connected to your question, but the idea is present in some of the messages in this thread: "Full table scan (FTS) of big table would fill the buffer cache, thus flushing out blocks from other tables completely". This is not true, because during full table scan buffers are read into the "least recently used" (LRU) end of the buffer cache, meaning that they will be the first to be flushed out of the buffer cashe. Let's assume we have buffer cashe of 1000 blocks, 100 of which are free. When FTS begin, first 100 blocks are read into the empty buffers in the cache and they are all put in the LRU end of the free list. When next set of blocks (number is defined by MULTIPLE_BLOCK_READ_COUNT)) have to be read, Oracle must flush the same number of buffers from the cache and it chooses those from the LRO end of the list. Incidentaly, this are the same blocks that have where read previously in the same FTS operation. So in effect, FTS cycles through a limited number of bufers in the cache, thus preventin a single FTS to flood the whole buffer cache.
2. The optimizer mode (RULE, CHOOSE, FIRST_ROWS, ALL_ROWS) has absolutely no effect wheter the rows will be returned to user as soon as each of them is found or when all the rows are processed. This is determined by the type of a particular operation - "row operations" will return row by row as soon as rows are found, while "statement operations" cuse resulting rows to be returned as a whole result set at the end of the operation, once all the rows are processed. Typical examples of "statement operations" are INTERSECT, MINUS, UNION, all kind of SORT operations, HASH JOIN, MERGE JOIN.... Typical examples of "row operations" are TABLE/INDEX SCANs, TABLE BY ROWID, FILTER, NESTED LOOP, AND-EQUAL, CONCATENATE.... So when FTS is executed (if it is a simpleFTS, without ORDER BY or being a part of HASH/MERGE JOIN), each row is returned to the user as soon as it is found, regardles of optimizer mode. But, of course, by choosing diferent optimizer mode you can dramatically influence on the decision whether FTS will be used or not.
3. To go directly to the answer of your question:
a) When there is no "statement operation" during the execution of the query (as FTS in your example, or INDEX_SCAN+TABLE_BY_ROWID or simmilar), as soon as single block or MULTIPLE_BLOCK_READ_COUNT set of blocks are read into the cache, they are examined for the rows that met all the conditions of the query. Those rows are returned immediately to the users. In the mean time, user proces reads another set of blocks into the cache, removing previously read blocks from the cache if necessarry. This goes on until all the table blocks are processed.
b) When there is a "statement operation" involved in a querry (for example, if you use ORDER BY in your query), oracle must perform a sorting (if you check explain plans of queries using any kind of statement operations like MINUS, MERGE JOIN, SORT ORDER BY, SORT AGGREGATE, SORT UNIQUE etc, you'll find out there is allways a sorting operation involved). So in this case, when the block is read into the cache, the resulting rows are found and put into the sort area of the shared pool, where they are sorted. This goes on untill all blocks involved are processed. If during this process sort area is filled up with the resulted rows, they are written into the temporary tablespace segment. After all the blocks have been processed, the final sort of the temporary segment is done (if necessary) and the whole result set is returned to the user.
So no matter how many blocks have to be read and how large the result set is, the resulting rows are returned to the user either immediately row-by-row, or at the end of the operation from the sorting area (assisted by temporary tablespace) as a complete resulting set.
Jurij Modic ASCII a stupid question, get a stupid ANSI
24 hours in a day .... 24 beer in a case .... coincidence?
If you're using Oracle Client , then there is swapping of disk space - Oracle will write the fetched rows onto disk and when all rows are retrieved, the result is thrown on to the screen/user. I don't know how it is if you're directly on the server.
This is my suggestion since I've seen a lot of swapping when dealing with big queries.
Kindly let me know if I am any where near to the answer.
Well, this has nothing to do with buffer cache as a part of a SGA.
This "client swapping" depends on what kind of client application/tool you use. For example, if you use SQL*Plus you won't notice any swapping on client machine, because all the recived rows are immediately shown on the screen and screen scrolls automaticaly when new rows arive.
But if you use some kind of tool that presents returned rows in some kind of "data grid" on the display, only first few rows are displayed at the screen, while others have to be "buffered" somewhere in the memory in case user chooses to scroll up/down the screen to see the rows that are currently not visible. If the amount of the returned rows is huge, client machine have to swap the memory to disk.
Jurij Modic ASCII a stupid question, get a stupid ANSI
24 hours in a day .... 24 beer in a case .... coincidence?
I am asking this coz the other day I was doing a massive select * from a huge table and I notice that after seeing certain number of rows in sqlplus the speed of returning rows slows down considerably (guess in this part buffer was full and Oracle was reading from disk again to fill the buffer?) then speed up again (in buffer now?) :D
Anyway I thought everyone would share this "unofficial paper" from Oracle, received from one of Oracle analyst through e-mail, as usual with these papers Oracle will not be responsible of consequences applying criteria from this paper
Here is a "unofficial" copy of a note (104937.1)--not sure if it will be
made external in the near future....
This note explains the new (Oracle 8i) algorithm for managing buffers in
the Oracle buffer cache.
Scope and application
This note is for INTERNAL use only and is not for general release.
Replacement algorithm for buffer cache in 8.1.3
The new replacement algorithm for managing buffers in the cache is no
longer LRU based. Instead, it is a variation of the "touch count", or
"clock frequency" scheme. It is described below assuming that all
are linked on a single list (different from implementation, which has
several lists, for keeping track of buffers that are absolutely useless,
those that need to be written, etc).
Basic description assuming all buffers on a single list
Buffer's have a "touch count" field that keeps track of the number of
touches (hits) a DBA has encountered while it is in the cache. Hits that
are very "close" (within _db_aging_touch_time seconds) are counted as 1
You can see the current number of touches per buffer by dumping the
You might alternatively query x$bh as it has been updated to show the
SQL> SELECT TCH FROM SYS.X$BH WHERE FILE#=4 AND DBABLK=2794;
1 row selected
What happens when a buffer is touched (hit) in cache?
On a hit, the following steps occur:
- Assuming that _db_aging_touch_time seconds has passed since we last
incremented the touch count, the touch count is increased by 1.
- The buffer is NOT moved from its current position in the list. That is
it stays where it is.
- Incrementing the touch count is done without any latching activity.
Thus, we may miss an increment to touch count occasionally as a
So basically, we increment touch count based on the time elapsed since
we last incremented it.
How is a victim selected for replacement?
If we need to read a buffer into the cache, we must first identify a
"victim" to be replaced:
- Victims are selected by scanning the list from the tail of the list.
IF ( touch count of scanned buffer > _db_aging_hot_criteria ) THEN
Give buffer another chance (do not select as a victim)
IF (_db_aging_stay_count >= _db_aging_hot_criteria) THEN
Halve the buffer's touch count
Set the buffer's touch count to _db_aging_stay_count
Select buffer as a victim
Where is a new buffer placed in the list?
Unlike LRU, where a buffer with a new DBA is always moved to the top of
the list (except for long table scans), with this scheme a buffer is
inserted in the "middle" of the cache. "Middle" is specified as a
percentage of the list, and is set for the 3 buffer pools using the
To understand this, think of the list being divided up into a "hot"
portion (buffers above the "middle" point) and "cold" buffers (those
that are below the middle point). Here, "hot" and "cold" are very loose
terms since it is possible for buffers with a very high touch count to
trickle down to the cold region, if they are not touched since their
series of touches.
Thus, a buffer read into the default cache will (by default) be
in the middle of the cache (_db_percent_hot_default = 50 by default) .
The reasoning for putting the buffer in the middle, as opposed to the
top, is to make this buffer earn its touch count by getting a few hits
before getting a chance to go to the top of the cache.
Actual Implementation Details
The actual implementation is a variation of the algorithm to account
for maintaining buffers to be written in separate lists (regular
writes versus ping writes versus writes due to reuse range and reuse
Also, once buffers are written, they are usually "useless", in the sense
that they have aged out. Such buffers are kept aside on a AUXiliary list
rather than the main list, provided there are no waiters (or
waiting for this buffer. If there are foregrounds that want to access
these buffers, then they are moved to the "middle" portion of the main
after re-setting their touch count.
The victim selection described earlier always prefers a buffer in the
auxiliary list over a buffer on the main list. If there are no buffers
on the auxiliary list, then it uses the algorithm described earlier for
the buffers on the main list.
Also, there is some special processing for CR buffers, which allows
buffers to be "cooled", and put at the tail of the main list after
their temperature below the "threshold" (parameters _db_aging_freeze_cr
Specific cases - full table (long) scans
From testing it appears that blocks read by a full scan of a long table
(see <Parameter:small_table_threshold>) will be placed at the end of the
list. Thus, these buffers will imediately become victims for
It was observed that during such a scan, every
db_file_multiblock_read_count's worth of blocks were placed at the end
of the list (LRU flag "moved_to_tail" bit was set) and were replaced by
the next db_file_multiblock_read_count's batch of blocks.
This effectively mimics the pre-8i behaviour for full table (long) scans
and avoids the cache being flushed by a huge table scan.
Specific cases - full table (short) scans and cached tables
The buffers are not moved to the tail of the list, and thus do not
immediate candidates for replacement.
Summary of parameters
Parameter name Default Description
------------------------------- ------- ----------------------------------
_db_aging_hot_criteria 2 Used to decide victim selection
_db_aging_stay_count 99 Touch count set to this if
(low value) low value < threshold during
_db_percent_hot_default 50 % Divides default cache into hot
(middle point) and cold regions; specifies
in the cache a new buffer is to
_db_percent_hot_keep 0 % Same as above, but for keep pool
_db_percent_hot_recycle 0 % Same as above, but for recycle
_db_aging_touch_time 3 secs Touch count not incremented if
(small interval) the buffer is touched within
_db_aging_touch_time seconds of
the last touch
"..was doing a massive select * from a huge table ... after seeing certain number of rows in sqlplus the speed of returning rows slows down considerably.... then speed up again"
This kind of fluctuations is very common in real life in general(look into airplains's properler or car's tire :-)).
Returning to your sample:
It least likely can be caused by described algorithm: the cache strategy for a full table scan for one user has no effect. The work by Oracle on LRU or whatever lists - searching of victim, etc is negligeble - those are fast in-RAM operations.
It can be because of interaction between disk HW, OS, Oracle and Client. Those are independant processes, running in parallel. Having in mind, that they process data in chunks (e.g Oracle read datafiles in buffers*DB_MULTIB...COUNT, piping data to client or sending via socket also use buffers, HW use buffers, possitions heads on disks onto right track), it could well happen that the performance graph is kind of oscillations.
For example, on WinNT you can see very nice CPU ussage picture when oracle is busy with some homegenous processing, like import/delete/ect of huge table.