question on CBO reliability

**LND** · 01-18-2001, 07:21 AM

can anyone share practical experiences of using regular statistics gathering for CBO:

the problem I experienced is that because of some CBO (known) issues CBO can generate plan that slows down query dramatically. Given that experience if CBO is used I tend to create statistics once, be sure everything works and fix statistics from further updates just to be on safer side.

**chrisrlong** · 01-18-2001, 11:59 AM

I'd be interested in what you call 'known' CBO issues, just to be sure they are real. For example, if you have an index on a field that is populated free-form from the user, then you have an issue. For example, if you have a free-form date field (the usual suspect) and 1 thousand people have entered dates between last year and this year, you will have a given statistical profile for this field:

LOW_VALUE: 1/18/2000
HIGH_VALUE: 1/18/2001

When given WHERE DATE > '1/1/2001', the optimizer will say 'Oh, they want (1/18/2001 - 1/1/2001) / (1/18/2001 - 1/18/2000) records, or 18/365 or ~5% of the data. Great! I'll use that index.

However, the next day someone enters a date of 1/18/3001. When you next update the stats, you end up with:

LOW_VALUE: 1/18/2000
HIGH_VALUE: 1/18/3001

Now, the optimizer looks at the same query: WHERE DATE > '1/1/2001', and says 'Oh, they want (1/18/3001 - 1/1/2001) / (1/18/3001 - 1/18/2000) records, or 365259/365625 or ~99.9% of the data. Guess I'll do a table scan.

This is, in my opinion, the most dangerous drawback of the CBO. Other than that, I love it!!!! Yes, there are some risks with constantly changing your stats, but they are small. My next biggest issue is with the ESTIMATE option. What a mistake that is, IMHO. Especially since it was flawed prior to 8.1.6.2. So if you coded to those stats and then switched versions, you were in for a surprise. Always use COMPUTE if possible.

The frequency of updates (of stats) is then up to you. It depends on how dynamic your data is. If you do *not* do regular stat updates on a table that is extremely dynamic, you are actually at greater risk. The optimizer will never know that the stats are changing and will continue to use the same plans, many of which may begin to slow down slightly or significantly. Sooner or later one of those queries 'crosses the line' and goes out to lunch and never comes back. You want to keep your optimizer well fed with proper statistics, but don't do it blindly. Save off the stats before and after the analyze process and compare them. If any HIGH_VALUE or LOW_VALUES have deviated more than expected, look into it, and be ready to re-apply the old stats if things go bad.

Anyway, I'm done rambling now.

HTH,

- Chris

**LND** · 01-19-2001, 06:10 AM

Thank you Chrisrlong.
The 'known' issue is what you said about dates and not only that: if data is not evenly distributed (e.g index key A1 values are ("X" - 10000 times) and ("Y"-2 times) ) then query with bind variable where A1=:b1 will yield full scan.
The only solutions I know are :
1) to play with hints (+FIRST_ROWS, sometimes INDEX)
2) to use histograms (after all, it is not so easy to maintain them) and go for literals. But prior to 816 literals means separate entry into shared pool and long parsing time (I think parsing time is still longer for literals in 816 in CURSOR_SHARED=FORCE). However I think this is solution for DSS environment, but not for OLTP: normally I do not want Oracle to reparse sql statement and build new plan every time.
3) to clear data so that 'true' distribution is build, create statistics, generate plan and fix it (do not change statistics or use outlines...)

So I agree that CBO is good for DSS where queries response time is long enough to ignore reparsing and building new plan.
But for OLTP this is not the case.
It is interesting how to deal with mixed environments: OLTP and DSS in the same DB?
Is it normal solution to use fixed Outlines for OLTP application and 'free' mode for DSS application?

**chrisrlong** · 01-19-2001, 01:26 PM

Yes, the next biggest issue is with highly-skewed distribution of data. When you know that large percentages of certain values dominate a table, you probably will want to go the histogram route and use constants. I'm interested in why you say that maintaining histograms is difficult. As for the parsing time of statements, I've always considered that as the lowest of my tuning priorities.

In general, there usually aren't that many tables that have data that is that highly skewed. When it happens, you have the following situation:

In your example, value x occupies 99.9% of the data. When querying on value X, a table-scan is in order. One would assume that since this occupies a majority of the data, it will also be used in a majority of the queries. People will rarely query on Y. Therefore, if you were to 'pin' a plan for the query, you would probably pin the tablescan anyway. Even having the index would be useless overhead. So what are the solutions?

- Use a constant, index and histogram
Now, you will have twice as many queries that could get loaded into the SGA - one set for X and one set for Y (leave all the other values as binds). Once loaded, the X queries will likely stay in the SGA since they are accessed 99+% of the time, saving on any recompile issues. The 'Y' queries will likely get aged out, incurring the (minor) parsing overhead when re-loaded. However, when re-introduced and parsed, they will use the index, thus saving a massive tablescan, so they win out anyway. Since most 'Y' queries are quickly aged out, you haven't actually doubled the size of your SGA requirements, just increased them a little.

- Stay with binds and drop the index and histogram.
Since 'Y' represents such a small percentage of the whole, are you really *that* worried about the performance hit on the Y queries? Don't forget that the simple existence of the index will hurt DML on that table. Of course, you could always leave the index in the hope that someday the percentages change.

- Force the index all the time - no need for a histogram or constants
This will hurt the 'X' queries, but saves on histogram maintenance and parsing and SGA size. Because of the importance of 'X', this may not be the best route

Now I know that this is a simplistic example, but compare it to forced plans. Which plan would you force? Forcing plans eliminates choice 1 and only leaves you with *always* scanning or *always* using index - both of which can be done without forcing the plan. So basically, forcing the plan, or using the RBO can *not* get you any better performance than the CBO can by itself. Furthermore, even if you choose solution 2 and leave the index in place knowing that it will not be used, you at least have left your options open. If the statistics of the table change in the future, the CBO can adapt to those changes automatically. This is the key, especially as the CBO gets better and better with each release. IT takes care of the job of maintaining the proper performance of all your queries.

In my opinion, this is the strongest argument for the CBO. Yes, there are issues. You have to know its weaknesses and you have to know how to feed it. This is no different than you hand-optimizing every statement and storing the plans. There are risks that changes in stats will cause problems. However, if you hard-code plans and stats changes, you are far more likely to have performance issues than if you let the CBO do its job.

Yes, the SGA is an issue. You have to balance your number of unique statements against your performance needs. I'm interested in your CURSOR_SHARED=FORCE point also - I couldn't find any references for that one.

Again, I view parsing time as a minor priority. By this I mean that you should use binds and re-use statements as much as possible - *until* it interferes with other performance considerations. The difference in saving parsing versus saving a tablescan can be massive. I'll choose re-parsing every time. But again, recognize that the queries for the larger values will stay in the SGA and not need to be re-parsed anyway, so this is truly a minor issue.

As for mixing OLTP and DSS in same database - always a bad idea. They have radically different needs, designs, usages, securities, etc. - everything is different and mixing them is generally something to be avoided.

Finally, I'm not sure about your third option. It would seem that either the data is valid or it isn't. How would you "clear data so a 'true' distribution is built'?

Hope this all made sense,

- Chris

PS - I'd love to hear anyone else's comments on the subject as well.

**LND** · 01-20-2001, 08:11 AM

>As for the parsing time of statements, I've always >considered that as the lowest of my tuning priorities.

This indicates that we are enaged in building a bit different systems: my current jobs is all about performace (i.e. the system has ugly requirement to refresh data on client screens when some data changes in db) So parsing time simply can not be ignored.

However I agree that for DSS this is not an issue.
But for system I have to deal with plan stability at least for some(currently almost all) queries is top priority, otherwise this particular system will go down.
I am interested about using hints to enforce plan stability: what about different Oracle versions. Example: Oracle always threatens it will desupport RULE hint.

>CURSOR_SHARED
Sorry, it is CURSOR_SHARING. See Tuning guide or reference for 8.1.6

>Maintaing Histograms:
As far as I understand, histograms can not be imp/exp? Or they can?

> 3'td option to feed in 'correct ' statistics: I mean when Plan stability is priority, you can feed in development database with 'correct' data and hope CBO to generate reasonable plans. Than fix those plans with Outlines

**chrisrlong** · 01-22-2001, 12:45 PM

You misunderstood me. Tuning everything is, of course, important. I meant that in the long list of variables that affect performance, it is the lowest. That is because it is relatively fast. Look again at the scenario I presented. It was not presented as black-and-white. *Everything* is a matter of give and take. All things considered, re-compilation is one of the smallest times your queries will spend. I/O is *much* more important. Therefore, if you have to decide, as I tried to show, between a query that always tablescans, or one that sometimes can use an index if it has to recompile, choose the recompile option. Re-compiling is a *very* small amount of time in comparison to most database operations.

Now, no matter how you create the original statement, the statement should still be in the SGA and the plan fixed when you have to refresh the user. Therefore, it does not matter how you create the statement the first time. You can use dynamic SQL and it will still find the query the second time and still use the same, stored plan.

Given these two facts, I really don't see why you are so hung up on 'locking down' the plans. What brought you to this opinion that you had to do this? I still stick to my original example. You only have so many choices and if you decide to force a plan, you have to decide which plan, and using *either* plan *all* the time has drawbacks. Check my example again.

CURSOR_SHARING - Yeah, I thought I'd read this before. It just doesn't seem like a good idea to me, but I've never implemented it. perhaps others have differing opinions. You may want to start a different thread on that question, since it is unlikely that anyone other than us is reading this thread anymore :). Note that it is no different than using bind variables anyway. Therefore, it is a better idea to use binds where you know the data is evenly distributed and constants where it is skewed. Using CURSOR_SHARED=FORCED simply removes your second option.

Histograms - As I'm light on the admin stuff, I don't know this one. Sounds unfortunate if true, but certainly not insurmountable. And a little extra admin work is worth it if it saves you from your performance nightmare, no? I know you can export and import the regular stats by themselves. Check out this link to see if the histograms go along for the ride or not:
[url]http://technet.oracle.com/doc/server.815/a68001/dbms_sta.htm#1008494[/url]

Still don't understand the last issue - I'm being thick :). Regardless. No matter what you do in development to come up with 'better' plans, you are still trying to 'fix' these plans in place. Why? Again, refer to my example. You are hoping that there is some magic plan that will be best for all of your data. But when your data is severely skewed, that just won't happen. Look again at my example. You will either be using the index or not - for *every* value, if you 'fix' the plans. Neither choice is perfect *for all data sets*. This is where the CBO shines. And again, when you search on Y values, a tablescan will be wayyyy longer than recompiling the statement so you can use the index.

I hope I've made myself a little clearer this time. Mind you, if all your data was evenly distributed and your stats were not changing much over time, I'd be all for your idea. Tune it all, come up with the best set of plans for everything and then fix them in place to eek out that last bit of performance. But with highly skewed data, one plan will not fit all. Again, if you don't care about the smaller sets, then fix the plans for the larger sets, if you want. As long as you understand the consequences.

- Chris

Thread: question on CBO reliability

Thread Tools

Display

Posting Permissions