DBAsupport.com Forums - Powered by vBulletin
Results 1 to 9 of 9

Thread: Insert many millions of lines coming from text files

  1. #1
    Join Date
    Sep 2000
    Location
    Sao Paulo,SP,Brazil, Earth, Milky Way
    Posts
    350

    Insert many millions of lines coming from text files

    I want to insert in a table data (daily) coming from diverse text files, each one with some millions of lines. I´m using Oracle EE 9.2.0.5, so I´m thinking in the lines of :

    a) load the text file (with sqlldr) into a common, heap table, and use the MERGE command into the destination table: main problem here, I will need some extra working space inside the database to keep the staging table, and will need to truncate them after the load.

    b) mount the external text file as an external table, and use just one INSERT /*+ APPEND */ from external into the destination. Main problem here : the destination table HAVE a primary key, and I need to keep a list of bad (duplicated/invalid) records, and INSERT runs as a whole, any bad record will mean an error as a whole to the INSERT command.

    c) mount the text file as external table, and using MERGE with the destination table WHEN NO_MATCHED INSERT nnnn : main problem here, the WHEN_MATCHED do not permit (AFAIK) insert into another table - a "log error table" , one.

    d) PL/SQL program, reading in bulk from external table, inserting FORALL with EXCEPTIONS clause. Will do all what I want (probably), but will require a non-trivial level of work.

    Any ideas more ??

  2. #2
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    You might consider option (b) with the additional steps of ...
    i) disabling the PK
    ii) loading the data
    iii) enabling the PK using the exceptions into clause to log PK violations into another table ... http://download-west.oracle.com/docs...3a.htm#1002528
    iv) Deleting/moving/logging whatever you want with the violating rows
    v) Renabling the PK
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  3. #3
    Join Date
    Aug 2002
    Location
    Colorado Springs
    Posts
    5,253
    Ah, just thought of another possibility. You could use the external table method and query it with an analytic function that would indicate whether the row was a duplicate of another in the file ... for example if the pk columns were PK1 and PK2 then you could query with ...
    Code:
    Select
       Row_Number () Over
          (Partition By PK1, PK2) PK_Occurance,
       Count() Over
          (Partition By PK1, PK2) PK_Count,
       PK1,
       PK2,
       ...
    From
       my_external_table
    ... and use a multi-table insert to either ...
    i) Insert all rows contributing to a PK violation into an error table, or ...
    ii) Insert all-but-one rows contributing to a PK violation into an error table
    ... by using a WHEN clause based on PK_Occurance or PK_Count in the above query.

    Well, I think it would work.
    David Aldridge,
    "The Oracle Sponge"

    Senior Manager, Business Intelligence Development
    XM Satellite Radio
    Washington, DC

    Oracle ACE

  4. #4
    Join Date
    Nov 2000
    Location
    greenwich.ct.us
    Posts
    9,092
    You might consider an INSTEAD-OF trigger and handle the logic of what should happend with the data once inside the trigger. This way, you mount your text file as an external table, and INSERT from that table. Here again, non-trivial pl/sql work involved.
    Jeff Hunter

  5. #5
    Join Date
    Mar 2002
    Posts
    534
    Is this PK an number which will/should always increase?

    If YES you might do the following:

    Create an external table (source)
    Find out the min and max value from the pk of the source
    Using these 2 values you then could do an:

    insert /*+ append */ into target
    as select s.*
    from source s,
    (select pk_id from target where pk_id between MIN_SOURCE_PK_ID and MAX_SOURCE_PK_ID) t_pk
    where s.pk_id = t_pk.pk_id(+)
    and t_pk.pk_id is null

    insert /*+ append */ into duplicates
    as select s.*
    from source s,
    (select pk_id from target where pk_id between MIN_SOURCE_PK_ID and MAX_SOURCE_PK_ID) t_pk
    where s.pk_id = t_pk.pk_id

  6. #6
    Join Date
    Mar 2002
    Posts
    534
    Do you plan to partition your target table? If yes over which column?

  7. #7
    Join Date
    Sep 2000
    Location
    Sao Paulo,SP,Brazil, Earth, Milky Way
    Posts
    350
    Friends, before anything, thank you all for so many ideas... Let´s go :

    mike, yes, it will be partitioned. In truth, the table have 5 columns, and the PK is made by all the 5 columns, so I´m thinking about make it an IOT - and no, none of the PK columns will be a sequence, I have a date, 2 varchars, 2 number, all of them user-data, no artificial values here. Of the 5 columns, the first one (the date/time of processing) is more significative (and the volume what I will process each day is roughly the same), so this column can be a good partition key.
    I´m considering make it a range partitioned IOT by this first column - of course, IOT means some degrading in insert (due to the fact that I will be inserting into an index, a complex structure), BUT I will save a whole lot of space(see, dozens of millions, with IOT I will have only the index extents, while for a heap table + PK Index I will use a lot of space for table blocks, and another lot for index blocks.

    Jeff, I will test triggers, but besides it requires more work, I have some doubts about the performance of triggers in this kind of volumes of data.


    Dave : yes, the exceptions clause when enabling PK is a close possibility, but I never used it . My fear is : see, today I have (say) 20 millions of rows, validate. Tomorrow I disable PK, load , say, 15 millions of rows more , when I re-enable the PK the database will re-check/read the old 20 millions PLUS the new 15 millions ?
    And about option c) , really I´m correct about merge not being capable of producing a list of exceptions in another table or file ?

    regards,

    Chiappa

  8. #8
    Join Date
    Mar 2002
    Posts
    534
    which is the purpose of your PK?
    OR
    Why do you need a PK if you check during the data load that your new records are unique?

    Before using an IOT I would realy test if there is any benefit of using an IOT. Personnaly I think that it is usefull only if you access a few rows in a query. So my question is what kind of queries are going to be executed against this table?

    In case all your columns are not null you may do somthing like:
    INSERT ALL
    WHEN t.date_column is null THEN
    INTO target
    WHEN t.date_column is not null THEN
    INTO duplicates
    SELECT s.*
    FROM source s,
    (
    select * from target
    where date_column >= (select min(date_column) from source)
    ) t
    where s.date_column = t.date_column(+)
    and s.col_2=t.col_3(+)
    and s.col_3=t.col_3(+)
    and s.col_4=t.col_4(+)
    and s.col_5=t.col_5(+)

  9. #9
    Join Date
    Sep 2000
    Location
    Sao Paulo,SP,Brazil, Earth, Milky Way
    Posts
    350
    Hi, mike, the answers :

    > which is the purpose of your PK?
    >OR
    >Why do you need a PK if you check during the data load that your new records are unique?

    well, the text files will be user´s data to be inserted directly in the data table (more specifically, phone call records) - 99.9% of the times the files will be OK, but WILL be some (** rare ** ) cases of dups. The idea here is : in this situation, I think that WILL be more performant to try the INSERT right on first (the PK will get the dups), rather than , for each one of the millions of lines, to try a SELECT againts the db table to check the data read, that´s it.

    >>Before using an IOT I would realy test if there is any benefit of using an IOT. Personnaly I think that it is usefull only if you access a few rows in a query. So my question is what kind of queries are going to be executed against this table?

    Once loaded, the queries against the db table will access tipically between 10% and 20% of the table - and the table will be partitioned (IOT or heap, don´t matter). The point here is : the table will be in the order of dozens of millions, so 10% of this huge volume WILL be of some concern, I think. I must test , test and test, this is teh only way to be sure about the exact degrade due to IOTs, afaik. What you think ?


    >> In case all your columns are not null you may do somthing like:
    INSERT ALL
    WHEN t.date_column is null THEN
    INTO target
    WHEN t.date_column is not null THEN
    INTO duplicates
    SELECT s.*
    FROM source s,
    (
    select * from target
    where date_column >= (select min(date_column) from source)
    ) t
    where s.date_column = t.date_column(+)
    and s.col_2=t.col_3(+)
    and s.col_3=t.col_3(+)
    and s.col_4=t.col_4(+)
    and s.col_5=t.col_5(+)
    .....


    hmm, I know (in PL/SQL) :

    FORALL nnn
    insert into table EXCEPTION INTO ..

    not this "INSERT ALL" syntax - is it new with oracle 9 ? Is it in plain SQL or PL/SQL ?

    regards.

    Chiappa

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Click Here to Expand Forum to Full Width