We have a number of separate data feeds that all insert rows into a central table. It is quite possible (and legitimate) for these feeds to try to append rows that already exist in the table. If this happens we want to discard duplicates just keeping one copy of the row. We use SQL*Loader (sqlldr) to append rows into the table.
As a a quick and convenient way to solve this we put a unique key constraint across all the columns in the table and then configured the CTL file to allow a huge number of errors/discards so that the constraint vioations got ignored.
This works great until you try more than 10 or 20 rows! We're pushing in around 30K rows at a time and normal load times are about 7 or 8 seconds across a network, but when we load a batch full of duplicates (emulated by loading the same file twice) the load drops to about a minute and a half. Turning off errors (SILENT=(ERRORS)) takes about 10/15% off the time but its still really slooooow. Is there anything I can do on the DBMS table to help speed this up? FYI - I'm not interested in knowing which rows failed.
I would like to do in the following way:
1 Load all data into a temp table with direct option.
2 Create an index on the temp table. The index key must be equal to PK of the original table.
3 Delete duplicated rows in the temp table.
4 Insert into orginal table from temp.
What I think is unique constraint on the original table solws down the load.
I had implemented some time ago like this. I don't know if this works to you.
1. Create a duplicate table with nologging option.
2. Use direct load on to duplicate table
3. Use SQL Statement
insert into main_table
select * from duplicate_table
select * from main_table
Thanks for the suggestions but unfortunately it's not that simple! (is it ever?!) The duplicates I'm loading may duplicate with rows loaded yesterday or some other point in time - there are not necessarily duplicates in what i'm loading 'now' - but with what's already there.
These tables can grow to become signifcantly sized (25-30m rows, 12 columns of data) so I don't like the idea of doing a 'select *' on it anywhere!! (Our hardware isn't the best either!)
Where ur doing the data feed,first
create a temp table with a flag column
1.insert into a temp table.
2.check if there are any records already present in the main table.
3.if yes then
4.update flag column of the temp table to say 'D' ,for all the matching records found in the main table
5.then delete all the records from the temp table with flag column having value 'D'.
6.and then insert the remaning records from the temp table into the main table.
6.If there are no matching records then update the flag column as 'N'.Then insert all the records from temp table to main table with flag column having value 'N'.