I wrote a query with join and same one with the subquery.
When I did explain plan it surprised me that the subquery was faster than the join. I was expecting it other way.
Does any one know why the subquery ran faster than the join query.
I had index on AuthAttr_Res_ID ,AuthAttr_Work_ID
SQL> Select a.Res_ID
2 from mwebRes a, MwebAuthAttrib b, mwebwork c
3 where a.Res_Status <> 20
AND a.Res_Entity_Type = 3
4 AND a.Res_ID =b.AuthAttr_Res_ID
5 AND b.AuthAttr_Work_ID =c.Work_ID
6 AND c.Work_Entity_Type = 4 AND c.Work_Status <> 40 group by a.res_id ;
SQL> Select Res_ID
2 from mwebRes
3 where Res_Status <> 20 AND Res_Entity_Type = 3
4 AND Res_ID IN (Select AuthAttr_Res_ID
5 From MwebAuthAttrib
6 where AuthAttr_Work_ID IN
7 (Select Work_ID
8 From MwebWork
9 where Work_Entity_Type = 4 AND Work_Status <> 40));
It would help if you could provide the execution plans for both statements.
The main benefit for a subquery is to 'flatten' M:1 relations. For example, if I want some information about employees that worked on a given project, and this information is in a bunch of tables, it will probably be better to join all those tables in the main query and just do a WHERE EMP_ID IN ( SELECT DISTINCT EMP_ID FROM TIMECARD WHERE PROJECT_CD = :PROJECT_CD). This is because the timecard table is going to have *tons* of entries for each employee that used that project code. If I join that table through the whole query, then my entire resultset size will be hundreds of times larger than it has to be up until the point of doing some sum or distinct or group by at the outer-most level. If I move the DISTINCT (or sum or whatever) into the inner-most level, then I may save a lot of time building the rest of the resultset.
With an IN, even without the DISTINCT, you will end up with a smaller resultset when you're done. Since you are doing a group by at the end of the first query, I will assume that this is the case. You should probably add the DISTINCT keyword in whichever of the INs is actually returning duplicate IDs (or both). This will make the functionality more obvious and also improve performance, since it will not need to search on the same ID twice for inclusion.
This has confused me, what I observed is the query with IN subquery (without group by, distinct ) was faster than the join with group by or distinct when I was expecting reverse.
As you said I should have distinct inside IN for subquery for making it more fast ( thanks for this tip).
But that was not the point the subquery without the distinct is also faster than the join. I have always read that one should use join and not the subquery whenever possible so I was thinking that join would be faster !!!
BTW Tamilselvan> I am writing this on the basis of EXPLAIN PLAN.
- Re-read my reply. I explained why the sub-select was faster. It has to do with the flattening of the M:1 relationship. When this is a factor that can be isolated to a sub-select, then the sub-select will be faster. When this is not a factor, then a join will give the optimizer more options.
- I mentioned the DISTINCT as a matter of clarification and optimization of the IN query, *not* in the context of comparison with the first query, since I had already explained that one.
- Please provide the explain plans when asking to compare statements - makes life much easier :)
I will try, however, to more clearly explain the benefits of a sub-select versus a join. Relize that in *most* situations, the join is better. However, when you can flatten M:1 relationships in the sub-select, then the sub-select is better.
So I have:
EMP_T with 10,000 records
TIMECARD_T with 1,000,000 records
EMPDEMO_T with 10,000 records. This is a 1:1 with EMP_T holding demographic info (Sorry, but can't think of a better example at the moment)
Now, if I want all the employees (along with their, uh, Hair Color) that have ever worked on a given project, I have 2 (for purpose of the example) ways to do this:
-T.PROJECT_CD = :PROJECT_CD AND
-E.EMP_ID = T.EMP_ID AND
-ED.EMP_ID = T.EMP_ID AND
-ED.EMP_ID = E.EMP_ID
Now, the optimizer is going to start with T because that has the initial restriction. Let's look at the records that satisfy the query in the various tables:
T: 10,000 records for emps that woked on that project
E: 1000 emps worked on that project
ED: 1000 emps with hair color info :)
Let's look at what the optimizer does with this query:
T: Pulls 10,000 records
Joins in E (or ED)'s 1000 records, by going into table 10,000 times and ends up with a resultset size of 10,000
Joins in ED (or E)'s 1000 records by going into table 10,000 times and ends up with resultset size of 10,000
Does the GROUP BY to cut 10,000 records back down to 1,000
Now, what is the other choice:
-E.EMP_ID IN (
-----T.PROJECT_CD = :PROJECT_CD ) AND
-ED.EMP_ID = E.EMP_ID
Now, what does the optimizer do:
T: Pulls 10,000 records
Does a DISTINCT to cuts that down to 1,000 EMP_IDs
Goes into E 1,000 times and ends up with a resultset size of 1,000
Joins in ED by going into the table 1,000 times and ending with a resultset size of 1,000
Hopefully, you can see why the sub-select query is faster in this case. The huge table, the one with the M:1 relationship - TIMECARD, had the M:1 relationship isolated and 'flattened' in the sub-select, saving us a lot of extra work in the rest of the query.
Again, remember that this is not a common occurence. You generally want to avoid sub-selects because they 'hand-cuff' the optimizer by reducing the choices it has for solving the query. However, keep an eye out for this type of query because this is when the sub-select should be used and really shines.