index clustering - Page 2
DBAsupport.com Forums - Powered by vBulletin
Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 30

Thread: index clustering

  1. #11
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    Not sure if I would care about clustering factor in a DWH.

    In a DWH mostly we do full table scans AKA full partition scans so why bother with index.... Plus you can only ensure a good index clustering factor of many indexes in a table and itīs normally PK. If your queries are not going to use PK very often I wouldnt bother to look clustering factor at all.

  2. #12
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    Originally posted by pando
    Not sure if I would care about clustering factor in a DWH.

    In a DWH mostly we do full table scans AKA full partition scans so why bother with index.... Plus you can only ensure a good index clustering factor of many indexes in a table and itīs normally PK. If your queries are not going to use PK very often I wouldnt bother to look clustering factor at all.
    Well, I'm just going to have to go ahead and disagree -- maybe in your particuklar DWH you mostly do full scans, but maybe that is also beacuse you don't implement physical row ordering that would make index-based access more efficient. Does that count as Catch-22?

    Anyway, I wasn't sure what you meant by "you can only ensure a good index clustering factor of many indexes in a table" ... if you have a single column that is very often a filtering condition in user reports then it's generally worth ordering by that single column.

    I don't think that whether this column is part of the PK or not is relevant ... personally I rarely place PK constraints on DWH fact tables (certainly not enabled ones, anyway), because ...

    • There is rarely a need to reference the table with an FK
    • The ETL process can enfore the integrity
    • The index would rarely be of any value
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  3. #13
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    Originally posted by abhaysk
    well by "flush and fill" i meant .. truncate data and load afresh for every batch..

    if loading into the table is not of that kinda, then for every 'batch inserts' the keys are going to get loosely packed.. so during a course of time it makes no sense, as your clsuter factor is going to go up for every Batch Inserts..
    True, but it's not going to go up by very much, because the column values are still clustered, they're just not all in one single cluster. As long as each load cycle has enough rows that each value of the clustered column spans a reasonable number of blocks (say, a thousand or so) then having 100 individual clusters of 1000 blocks is going to show up as almost the same clustering factor as having a single cluster of 100,000 blocks.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  4. #14
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    Originally posted by abhaysk
    But is it worth, to achive small benifits, do an ordering which may cause bottleneck in your Temp TS.
    "Benchmark it", is the answer ... I've seen 10x performance increases with this technique.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  5. #15
    Join Date
    Jun 2000
    Location
    Madrid, Spain
    Posts
    7,448
    Well you need the FK if we want to use query rewrite efficiently with MVs

    The DWH I have seen mainly do aggreates, groupping so most of the times are reading full partitions and star joins

  6. #16
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    Originally posted by pando
    Well you need the FK if we want to use query rewrite efficiently with MVs

    The DWH I have seen mainly do aggreates, groupping so most of the times are reading full partitions and star joins
    Yup, but that's an FK from the fact table to a dimension table ... requires no PK on the fact table to do so of course.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  7. #17
    Join Date
    Dec 2002
    Location
    Bangalore ( India )
    Posts
    2,434
    Originally posted by slimdave
    True, but it's not going to go up by very much, because the column values are still clustered, they're just not all in one single cluster. As long as each load cycle has enough rows that each value of the clustered column spans a reasonable number of blocks (say, a thousand or so) then having 100 individual clusters of 1000 blocks is going to show up as almost the same clustering factor as having a single cluster of 100,000 blocks.
    I still mantain, will that significantly change the plan.. ? (depends but most often not)..

    More over LIO are going to remain same irrespective of the clustering factor.. yes your PIO will increase but not that significantly..
    ( please also see the PIO may increase only for operations like 'Table access by Index Range Scans' )..

    Let us also not forget original poster said he was ordering data by PK columns.. so there is no way for this clustered column to span over N number of blocks ( a reasonable fig that you specified 1000 and that to for a particular batch laod ).. Or i got you wrong here as to what exactly you meant.

    For the clustering factor to remain more or less to its initial value, can think of only one stiuation, where first column of the clustered column is a incremental value generated by some sequence or so.. and that for every batch inserts the data is ordered by this sequence..

    But if its a char/varchar (or even number but not a sequence generated one) type then i dont see any point..

    Lets do a simple test how this clustering factor is getting affected per batch load.. ( a simple simulation with few blocks per batch .. )

    Code:
    DBA> select * from dba_ind_columns where table_name = 'REPORTING_ORG_HIERARCHY'
      2  and index_name like 'PK_%';
    
    INDEX_NAME                     TABLE_NAME                     COLUMN_NAME               COLUMN_POSITION
    ------------------------------ ------------------------------ -------------------- --------------------
    PK_REPORTING_ORG_HIERARCHY     REPORTING_ORG_HIERARCHY        ORG_HIERARCHY_TYPE                      1
    PK_REPORTING_ORG_HIERARCHY     REPORTING_ORG_HIERARCHY        ORG_CD                                  2
    PK_REPORTING_ORG_HIERARCHY     REPORTING_ORG_HIERARCHY        ORG_TYPE                                3
    
    DBA> create table test_clustering_factor tablespace wwciw_tab_1 pctfree 0 pctused 99
      2  storage(initial 10M next 10M pctincrease 0) as select * from REPORTING_ORG_HIERARCHY where rownum < 1;
    
    Table created.
    
    DBA> create index ak_test_clustering_factor on test_clustering_factor(ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE) 
      2  tablespace wwciw_idx_1 pctfree 5 storage(initial 5M next 5M  pctincrease 0);
    
    Index created.
    
    DBA> insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2  ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE;
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR 
      2  from dba_indexes where table_name = upper('test_clustering_factor')
      3  /
    
    TABLE_NAME                     INDEX_NAME                               LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------  -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK_TEST_CLUSTERING_FACTOR                         34                 9339                  183
    
    DBA> insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2  ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE;
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR 
      2  from dba_indexes where table_name = upper('test_clustering_factor')
      3  /
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK_TEST_CLUSTERING_FACTOR                       101                 9339                18678
    
    DBA> insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2  ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE;
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR 
      2  from dba_indexes where table_name = upper('test_clustering_factor')
      3  /
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK_TEST_CLUSTERING_FACTOR                       135                 9339                28017
    
    DBA> create table dummy(dummmy_id number) tablespace base_tab_1 storage (initial 16K next 16K);
    
    Table created.
    
    DBA> insert into dummy values(1);
    
    1 row created.
    
    DBA> /
    
    1 row created.
    
    DBA> /
    
    1 row created.
    
    -- Now lets insert recs in one go (that was done in 3 batches above) and check the clustering factor..
    
    DBA> truncate table test_clustering_factor;
    
    Table truncated.
    
    DBA> insert /*+ append */ into test_clustering_factor select r.* from REPORTING_ORG_HIERARCHY r, dummy
      2  order by ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE;
    
    28017 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> ed
    Wrote file afiedt.buf
    
      1  select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR 
      2  from dba_indexes where table_name = upper('test_clustering_factor')
    DBA> /
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK_TEST_CLUSTERING_FACTOR                       102                 9339                  550
    
    DBA> select table_name, num_rows, blocks from dba_tables where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                                 NUM_ROWS               BLOCKS
    ------------------------------ -------------------- --------------------
    TEST_CLUSTERING_FACTOR                        28017                  550
    
    -- Interestingly you can see here the minimum possible clustering factor is equal to distinct blocks indentified in
    index entries.. Which is obivious as the table data is also ordered.
    Abhay.
    Last edited by abhaysk; 07-01-2004 at 07:20 AM.
    funky...

    "I Dont Want To Follow A Path, I would Rather Go Where There Is No Path And Leave A Trail."

    "Ego is the worst thing many have, try to overcome it & you will be the best, if not good, person on this earth"

  8. #18
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    There is indeed an issue here with the PK -- the clustering factor for a PK index will be much higher (equal to num rows seems not at all unlikely) for this batch load type. Pretty much the same applies to indexes where the index is close-to unique, as in your case.

    However I'd argue that this test case isn't relevant in the context of a DWH because there are two cases to consider.

    • When querying on the full set of index columns, you only retrieve very few rows Iin your example, three rows). Oracle will use the index to retrieve the rows, and the difference between reading one and reading three table blocks is practically insignificant.
    • When querying on a limited set of the indexed columns, say ORG_HIERARCHY_TYPE and ORG_CD, the index does not indicate the clustering factor for those columns alone and therefore the technique ishamstrung by lack of information.


    What would be more typical is that you would have three single-column indexes (bitmap type, possibly) on the three columns. The clustering factor would be indicated on a column-by-column basis, and the optimizer would have more information to go on.

    If you still have your test case handy, try placing single column indexes on the two leading columns, and get clustering factors for them individually. It might be more demonstrative of a real DWH structure -- except for one of pando's index-less abominations
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  9. #19
    Join Date
    Dec 2002
    Location
    Bangalore ( India )
    Posts
    2,434
    Originally posted by slimdave
    There is indeed an issue here with the PK -- the clustering factor for a PK index will be much higher (equal to num rows seems not at all unlikely) for this batch load type. Pretty much the same applies to indexes where the index is close-to unique, as in your case.
    Not true if first col is Sequence Generated or of that sort..

    Originally posted by slimdave
    However I'd argue that this test case isn't relevant in the context of a DWH because there are two cases to consider.

    • When querying on the full set of index columns, you only retrieve very few rows Iin your example, three rows). Oracle will use the index to retrieve the rows, and the difference between reading one and reading three table blocks is practically insignificant.
    • When querying on a limited set of the indexed columns, say ORG_HIERARCHY_TYPE and ORG_CD, the index does not indicate the clustering factor for those columns alone and therefore the technique ishamstrung by lack of information.
    For your quote 1.. LIO will still remain same !! ( ok PIO may increase by 2 as you said ).. But in cases of real data particluarly where Char datatypes are involved you really cannot expect your CF to remain same for such Batch Operations..

    Well if you still think deeper and know your application very well as to how data is queried regularly than you can decide whether its good to order your data..

    For your quote 2.. lets try with yet another test case..

    Originally posted by slimdave
    What would be more typical is that you would have three single-column indexes (bitmap type, possibly) on the three columns. The clustering factor would be indicated on a column-by-column basis, and the optimizer would have more information to go on.

    If you still have your test case handy, try placing single column indexes on the two leading columns, and get clustering factors for them individually. It might be more demonstrative of a real DWH structure -- except for one of pando's index-less abominations
    Code:
    DBA> ed
    Wrote file afiedt.buf
    
      1  create table test_clustering_factor tablespace wwciw_tab_1 pctfree 0 pctused 99
      2* storage(initial 10M next 10M pctincrease 0) as select * from where rownum < 1
    DBA> /
    
    Table created.
    
    DBA> create index ak1_test_clustering_factor on test_clustering_factor(ORG_HIERARCHY_TYPE)
      2  tablespace wwciw_idx_1 pctfree 5 storage(initial 5M next 5M  pctincrease 0);
    
    Index created.
    
    DBA> ed
    Wrote file afiedt.buf
    
      1  create index ak2_test_clustering_factor on test_clustering_factor(ORG_CD)
      2* tablespace wwciw_idx_1 pctfree 5 storage(initial 5M next 5M  pctincrease 0)
    DBA> /
    
    Index created.
    
    DBA> ed
    Wrote file afiedt.buf
    
      1  insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2* ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE
    DBA> /
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> ed
    Wrote file afiedt.buf
    
      1  select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR
      2* from dba_indexes where table_name = upper('test_clustering_factor')
    DBA> /
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK1_TEST_CLUSTERING_FACTOR                       19                    5                  183
    TEST_CLUSTERING_FACTOR         AK2_TEST_CLUSTERING_FACTOR                       23                 8567                 1283
    
    DBA> select table_name, num_rows, blocks from dba_tables where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                                 NUM_ROWS               BLOCKS
    ------------------------------ -------------------- --------------------
    TEST_CLUSTERING_FACTOR                         9339                  183
    
    DBA> insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2  ORG_HIERARCHY_TYPE, ORG_CD, ORG_TYPE;
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR
      2  from dba_indexes where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK1_TEST_CLUSTERING_FACTOR                       56                    5                  374
    TEST_CLUSTERING_FACTOR         AK2_TEST_CLUSTERING_FACTOR                       63                 8567                18272
    
    DBA> select table_name, num_rows, blocks from dba_tables where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                                 NUM_ROWS               BLOCKS
    ------------------------------ -------------------- --------------------
    TEST_CLUSTERING_FACTOR                        18678                  366
    
    -- Even after second batch you are not seeing significant increase in CF for AK1 index only because ORG_HIERARCHY_TYPE 
    have very few a values.. But do check the other index AK2 its increased drastically which is obvious that it is in 2nd postion
    of that order list.. 
    
    DBA> select distinct ORG_HIERARCHY_TYPE from test_clustering_factor;
    
    ORG_
    ----
    APSL
    FORC
    LPIV
    PHIV
    PLSL
    
    DBA> truncate table test_clustering_factor;
    
    Table truncated.
    
    -- Lets check with data ordered by Org_Cd first and then check.. ( PS distinct values of ORG_CD is very high )
    
    DBA> insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2  ORG_CD, ORG_HIERARCHY_TYPE, ORG_TYPE;
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR
      2  from dba_indexes where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK1_TEST_CLUSTERING_FACTOR                       19                    5                  345
    TEST_CLUSTERING_FACTOR         AK2_TEST_CLUSTERING_FACTOR                       23                 8567                  184
    
    DBA> select table_name, num_rows, blocks from dba_tables where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                                 NUM_ROWS               BLOCKS
    ------------------------------ -------------------- --------------------
    TEST_CLUSTERING_FACTOR                         9339                  184
    
    DBA> insert /*+ append */ into test_clustering_factor select * from REPORTING_ORG_HIERARCHY order by
      2  ORG_CD, ORG_HIERARCHY_TYPE, ORG_TYPE;
    
    9339 rows created.
    
    DBA> analyze table test_clustering_factor compute statistics for table for all indexes;
    
    Table analyzed.
    
    DBA> select table_name , index_name, LEAF_BLOCKS, DISTINCT_KEYS, CLUSTERING_FACTOR
      2  from dba_indexes where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                     INDEX_NAME                              LEAF_BLOCKS        DISTINCT_KEYS    CLUSTERING_FACTOR
    ------------------------------ ------------------------------ -------------------- -------------------- --------------------
    TEST_CLUSTERING_FACTOR         AK1_TEST_CLUSTERING_FACTOR                       56                    5                  690
    TEST_CLUSTERING_FACTOR         AK2_TEST_CLUSTERING_FACTOR                       63                 8567                17160
    
    DBA> select table_name, num_rows, blocks from dba_tables where table_name = upper('test_clustering_factor');
    
    TABLE_NAME                                 NUM_ROWS               BLOCKS
    ------------------------------ -------------------- --------------------
    TEST_CLUSTERING_FACTOR                        18678                  368
    
    -- Note that drastic increase in the CF for AK2 even after ordering data by ORG_CD ...
    funky...

    "I Dont Want To Follow A Path, I would Rather Go Where There Is No Path And Leave A Trail."

    "Ego is the worst thing many have, try to overcome it & you will be the best, if not good, person on this earth"

  10. #20
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    [Not true if first col is Sequence Generated or of that sort..
    Of course, that is the case ... as column values approach "distinctness", the benefits are reduced. If there are on average only three occurances of each value in the table then the maximum number of blocks over which that value can be spread is three.

    Let's suppose that your table has a total of 1,000,000 rows stored in 12,500 blocks, giving 80 rows per block. You have the option of physically ordering by a column that has 360 distinct values -- lets call it COL360 -- giving about 2,778 rows per value.

    Now if the values of COL360 are perfectly unclustered, then each value will be spread over about 2,778 blocks. If you want to retrieve all rows for a single value of COL360 then you are definately in FTS (or FPS) land. This is our worst-case scenario for potential PIO's.

    If the table were physically ordered by COL360 then on average each distinct value will be contained in 35 blocks (2778/80) -- hello index range scan!

    What we are looking for here is PIO change, and we've got it. This is probably close to the perfect case for physical row ordering.

    Well if you still think deeper and know your application very well as to how data is queried regularly than you can decide whether its good to order your data..
    yes, i think that would be essential -- there would be no point in ordering by a column that is not either heavily predicated in user queries, or that has a cardinality that is too high or too low for there to be a difference.

    In the example, it looks like there may be too many distinct values of ORG_CD for it to be worthwhile -- likewise there may be too few distinct values of ORG_HIERARCHY_TYPE.

    The technique's value degrades as the number of distinct values approaches extremes for different reasons ...
    • As distinct values approaches "1", the best possible clustering will still leave each value spread over so many blocks that a full scan will be more efficient that index-based access
    • As distinct values approaches number of rows then the number of blocks over which each value can be spread in an unclustered table is still snmall enough that index access can be used (although there are potential benefits in PIO reduction when the data is clustered)


    Allow me to propose some theories here ...
    • The greatest benefit, in terms of the ratio of reduction in blocks over which each value is spread, is achieved when the number of distinct values is between the number of rows per block and the number of blocks per table
    • The best achievable reduction ratio in blocks-per-value is roughly equal to the number of rows per block
    • If it is more efficient to full scan a table than to index access a table in order to retrieve "1/x" of the blocks, then it is only worthwhile ordering on a column that has more than x distinct values


    Of course, these theories all apply only to single column ordering, where thevalues for that column are equi-distributed. The presence of skew in the values will improve the potential benefits of physical row ordering.
    Last edited by slimdave; 07-01-2004 at 12:24 PM.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  



Click Here to Expand Forum to Full Width