Indexing only DISTINCT values

**JChiappa** · 08-15-2007, 03:56 PM

In a 9.2.0.7 EE database, I have a big table(tens of millions rows), the table´s PK is composite (ie, field1 + field2 + DATE_PROCESS + field4). DATE_PROCESS have many thousands of possible values, and there is a lot of repetitions - say, some hundreds of records with DATE_PROCESS 01-JAN-2007, some hundreds with 02-JAN-2007...
So far, so good, BUT I will have a new requirement, respond to queries like MAX(DATE_PROCESS), TOP-n DATE_PROCESS, etc. Of course, today DATE_PROCESS is "inside" the index, so the best what I can do is a index scan, not acceptable for performance - the answer, obviously, is create a new index with DATE_PROCESS only, ok, ** BUT ** in this case the index will be enormous, as I said I have a lot of repetitions for each DATE_PROCESS, and this is an overhead to my case, logically I need an index of ONLY the distinct values of DATE_PROCESS.... This is my question, therefore : how can I do something like that, maybe some trick with FBIs ? Of course, I know that I can compress index keys, compacting the repeated values, OR I can make a MV with distinct-only values and index that, BUT all these option implies in more space than the bare minimum, logically speaking.... IF I could have an index built only with the desired, distinct values, without repeating the data in MVs or alike, I would be a happy camper...

regards,

Chiappa

**PAVB** · 08-15-2007, 04:16 PM

Originally Posted by JChiappa

IF I could have an index built only with the desired, distinct values

think a little... how is going Oracle to point to rows that are not in your "magic" index but you want them to be pointed at?

Question... how's cardinality for your DATE_PROCESS column?... low cardinality might lead you to try bitmap index and have all the fun in the world with it.

**tamilselvan** · 08-15-2007, 04:19 PM

My understanding of your problem is:
Table Data
========
ROWID DATEPROCESS
---------------------
ROW-1 10-AUG-2007
ROW-2 10-AUG-2007
ROW-3 11-AUG-2007
ROW-4 12-AUG-2007
ROW-5 13-AUG-2007
ROW-6 13-AUG-2007

ROW-1 and ROW-2 have the same date and ROW-5 and ROW-6 have the same date.

You want to create an index that has ROW-3 and ROW-4 only.

AM I RIGHT?

**JChiappa** · 08-15-2007, 08:18 PM

PAVB, think about a Function Based b*tree index, in it you can choose what rows goes to the index and what do not, just returning a NULL for the non-desired rows, nulls are not indexed, we know . What I want is something like that, I would choose to index only one of the records with DT_PROCESS=10-AUG-2007 (any of them) , only one of the records with DT_PROCESS=11-AUG-2007, and so on. This kind of index would serve to find the max/top n withou repetition DT_PROCESS quickly, and using the smallest area possible (the repeated values does not matters, to this query).
Tamil, no : with your sample data, what I want is an index with one key entry to '10-AUG-2007' (could be row #1 or #2, does not matters, BUT notto both), one entry to '11-AUG-2007', one entry to '12-AUG-2007', and one entry with '13-AUG-2007' (pointing to row #5 or to row #6, any of them BUT not both).... To the extent of my knowledge, this can´t be done (what is possible is the options what I said in the original post and derivations of them) , but anyway I´m asking just to be sure....

regards,

Chiappa

**mike9** · 08-16-2007, 01:51 AM

what is the problem of having an index on a table with ten of millions of rows?

>** BUT ** in this case the index will be enormous, as I said I have a lot of
> repetitions for each DATE_PROCESS, and this is an overhead to my case

what kind of overhead?

Did you have a look at bitmap indexes (as PAVB already said)?
Are you using table compression?
Have you considered using range partitioning?

**Bore** · 08-16-2007, 06:40 AM

I have the feeling you are trying to use index instead of MView
create mview about that and it can solve your problem

**JChiappa** · 08-16-2007, 11:08 AM

Bore, of course, as I said in the original message, for sure an indexed MV ** or ** an non-unique compressed index on DT_PROCESS solves my problem (ie, to find max values and/or top-n max values quickly via an indexed search), but none represents the bare minimum extra disk space usage : with the MV option, I will have a segment for the MV data - the distinct records -, and another to the MV index, AND using a "normal" b*tree index on DT_PROCESS , even compressing , due to the relatively large number of key repetitions , I will for sure end with more disk space consumption than a "selective" index containing only the records of interest. As I said, probably this kind of index can´t be done, I will have to use one of the options, but to be sure I´m asking, one never knows, I do not know everything about every possibility, maybe someone suggests some clever way to do ...
Mike9, no problem per se, I was just trying to evitate it, in this system (a large DW-one) disk space is at a premium, each and every "space economy" if possible/viable/not performance-impacting must be done....So the "overhead" is just more disk usage than the bare minumum, maybe "overhead" was a bad choice of words... Table compression yes, and index compression is a possibility (like I said above), but NO to partition by date (the table is already partitioned by other keys) and NO to bitmap indexes (the table suffers from relatively constant DMLs) - by the way, even if via some "clever" trick I could obtaint an "selective" index like I said, the fact of DML ocurring will be a major impediment . Imagine, using some "trick" think I have an index with only the distinct values (ie, '10-AUG-2007' (could be row #1 or #2, does not matters, BUT not both), one entry to '11-AUG-2007', one entry to '12-AUG-2007', and one entry with '13-AUG-2007'), ok. What happens when the choosed record for, say, '10-AUG-2007' is deleted in the table, how Oracle would "locate" the other record for '10-AUG-2007' and "substitute" it in the index ?? yep, in each way this "requirement" begins to show really impossible, I will need to use some of the other options....

regards,

Chiappa

**tamilselvan** · 08-16-2007, 11:22 AM

Code:

SQL> 
SQL> create table tamil (id number(6), rdate date);

Table created.

SQL> 
SQL> 
SQL> create or replace function tamil_date_fn
  2  ( p_date IN Date) return number
  3  DETERMINISTIC
  4  as
  5  PRAGMA AUTONOMOUS_TRANSACTION;
  6  p_num int;
  7  begin
  8  	begin
  9  	  select count(*) into p_num from tamil
 10  	   where tamil.rdate = p_date;
 11  	exception when no_data_found then
 12  	  p_num := 0 ;
 13  	end;
 14  	return p_num;
 15  end;
 16  /

Function created.


SQL> 
SQL> create index tamil_idx on tamil
  2  ( case when tamil_date_fn(rdate) = 0 then rdate else null end);

Index created.

SQL> 
SQL> validate index tamil_idx;

Index analyzed.

SQL> 
SQL> select NAME, LF_ROWS , DISTINCT_KEYS from index_stats;
TAMIL_IDX                               0             0                                 

1 row selected.

SQL> 
SQL> 
SQL> insert into tamil values (1, trunc(sysdate - 8));

1 row created.

SQL> commit;

Commit complete.

SQL> insert into tamil values (2, trunc(sysdate - 8));

1 row created.

SQL> commit;

Commit complete.

SQL> insert into tamil values (3, trunc(sysdate - 7));

1 row created.

SQL> commit;

Commit complete.

SQL> insert into tamil values (4, trunc(sysdate - 6));

1 row created.

SQL> commit;

Commit complete.

SQL> insert into tamil values (5, trunc(sysdate - 6));

1 row created.

SQL> commit;

Commit complete.

SQL> insert into tamil values (6, trunc(sysdate - 5));

1 row created.

SQL> commit;

Commit complete.

SQL> insert into tamil values (7, trunc(sysdate - 4));

1 row created.

SQL> commit;

Commit complete.

SQL> 
SQL> select * from tamil;
         1 2007-AUG-08 00:00:00                                                         
         2 2007-AUG-08 00:00:00                                                         
         3 2007-AUG-09 00:00:00                                                         
         4 2007-AUG-10 00:00:00                                                         
         5 2007-AUG-10 00:00:00                                                         
         6 2007-AUG-11 00:00:00                                                         
         7 2007-AUG-12 00:00:00                                                         

7 rows selected.

SQL> 
SQL> validate index tamil_idx;

Index analyzed.

SQL> 
SQL> select NAME, LF_ROWS , DISTINCT_KEYS from index_stats;
TAMIL_IDX                               5             5                                 

1 row selected.

There are 7 rows in the table, but 5 entries in the index.

Warning:

1. Commit is needed after one row is inserted.
2. Bulk commit may not update the index.
3. Index rebuild may not work.

**mike9** · 08-16-2007, 12:37 PM

Originally Posted by JChiappa

...
in this system (a large DW-one) disk space is at a premium, each and every "space economy" if possible/viable/not performance-impacting must be done....So the "overhead" is just more disk usage than the bare minumum, maybe "overhead" was a bad choice of words...

From my point of view it's, for a DWH system, not disk space which is a premium but the performance of the disk system.

Also I don't see the big space impact you would have creating a simple btree index over a date field (with probably less then 10000 distinct values). I think that even with 100M rows you should not use much more then 3GB.

An alternative, without using any additional disk space, would be to recreate the PK index like ( DATE_PROCESS + field1 + field2 + field4).

**JChiappa** · 08-17-2007, 05:55 PM

Ops, sorry for the long delay, people... Mike9, yes, a very viable alternative could be put the desired filed in the "point" of the index, getting an index range-scan, yes, BUT not possible, other queries neeed filed 1 as the first field in the index... And yes, we are talking about shave a few Gbs only, and like I said, yes, performance is the first point, BUT "space economy if possible/viable/not performance-impacting must be done", this is common-sense in this kind of env....Specially when (like we do) the disk space is LOCATED from others, each and every Gb used results in a bigger invoice to be paid... So, IF POSSIBLE, I must to do any savings, IF not interfering with performance. And this will be the point determining the inviability of the whole idea - I could not to create a FBI in my tests, but reading Tamil´s post I see the use of AUTONOMOUS TRANSACTION to do it, for sure this will impact enormously in my performance, will inviablize index updatings (will demands a full index REBUILD each time), so I´m just foregtting about the whole thing, no free lunch here, no "ultra-clever" ideas....
I will measure the consumption (and the time being spent in index refreshing after DMLs) with the compressed b*tree option and the Mv option, and choose the best one.

Regards to all,

Chiappa

Thread: Indexing only DISTINCT values

Thread Tools

Display

Indexing only DISTINCT values

Posting Permissions