-
Please help me speed this up...
I have a process that takes a clob and checks it against a table of words. It takes about 17 seconds per clob. I need to speed it up.
Here the core of the code:
begin
open c_unclean_text;
loop
fetch c_unclean_text into v_text_id,v_text;
open c_words;
loop
fetch c_words into v_word;
if regexp_like(v_text,'["''*() .;:,^!-]'||v_word||'["''() .;:,$!-?*]','i') then
v_wordlist := v_wordlist||v_word||' ';
end if;
exit when c_words%NOTFOUND;
end loop;
close c_words;
if v_wordlist||'+++' != '+++' then
insert into vtest (id, wordlist) values (vtext_id, v_wordlist);
commit;
end if;
v_wordlist := '';
exit when c_unclean_text%NOTFOUND;
end loop;
close c_unclean_text;
end;
/
Some notes:
1. I used regexp_like() because LIKE would return "grASS" or "ASSociation". I just want delimited words.
2. What about using a pl/sql table for the word table?
3. The process works good, but it just takes too long. I may have 3k-5k of these a night. 3k takes about 13 hours to process.
4. I considered Oracle Text but I don't think it'll do regexp_like().
I hope you see something that can help.
-
Re: Please help me speed this up...
Originally posted by KenEwald
Some notes:
1. I used regexp_like() because LIKE would return "grASS" or "ASSociation". I just want delimited words.
2. What about using a pl/sql table for the word table?
3. The process works good, but it just takes too long. I may have 3k-5k of these a night. 3k takes about 13 hours to process.
4. I considered Oracle Text but I don't think it'll do regexp_like().
I hope you see something that can help.
I haven't heard of regexp_like. But I haven't done much with 10g yet, I had to look it up. As a PL/SQL programmer I would have to say that you should use implicit cursors whenever possible, which includes this case.
Second, Have you looked at the waits generated when this program is running? Do you know how much temp space this is using? I was playing with clobs in 9i a few years ago and I started to have problems with filling up the temp tablespace. Fetching a clob definitley adds overhead. Take a look at the DBMS_LOB package. There is some functionality that might be helpful, like instr.
It has been awhile since I have looked at this, but I remember something about creating an object that points to the correct row in the table and using the dbms_lob procedure to process the data. A Pl/SQL might be helpful.
Also are there any bug reports on this function as it relates to clobs?
Last edited by gandolf989; 10-18-2004 at 01:22 PM.
-
Thanks for the reply...
Finally if you are only storing 5k in the clob why don't you just make the column VARCHAR2(8000)?
The clob's could be very long. I was referring to 3 to 5,000 records a night.
I'll check into implicit cursors. Would that really help the performance?
I'm thinking the bulk of the work is the full table scan on the 5,177 row word table. For each Clob, I scan all 5,177 rows.
I'm guessing the biggest performance gain would be to put this in a memory table or array and walk through the array instead of the table. What do you think?
No problem with temp, I had 15 of these processes running at the same time on a 4 proc box to check 100,000+ clobs. It took about 2 days. Shaving a couple of seconds from each iteration would add up.
-
Can this not be turned into a straight insert, where you select the text_id and the matching words by joining the two table based on the regexp_like() function through a single SQL statement, then afterwards deal with concatenating the matching words into asingle record as a separate step?
-
let's give it a try;
insert into vtest (id, wordlist)
select t.text_id, w.word
from text_table t,
words w
where regexp_like(t.text,'["''*() .;:,^!-]'||w.word||'["''() .;:,$!-?*]','i')
It's valid syntax, but not returning rows. Hum..
text_table has 1 row: 1,MARTIN
words has a row where word = 'MARTIN'
It took 3 seconds .. guess that's because it created a cartesian product between the two tables.
Here's the explain plan...
ID PID Operation Name Rows Bytes Cost IN-OUT PQ Dist PStart PStop
0 SELECT STATEMENT 259 18K 6
1 0 NESTED LOOPS 259 18K 6
2 1 TABLE ACCESS FULL TEXT_TABLE 1 65 3
3 1 TABLE ACCESS FULL WORDS 259 1K 3
-
Added
where 1=1 and regexp_...
and it returns nothing in .5 seconds .. same execution plan.
By the way... I rewrote it using implicit cursors and it dropped from 17 sec to 13.
I'm working on creating words as a collection.. that's got to be faster.
-
What about this?
Code:
DECLARE
TYPE c_text_table IS TABLE OF text_table.text_id
INDEX BY BINARY_INTEGER;
TYPE c_words_table IS TABLE OF words%ROWTYPE
INDEX BY BINARY_INTEGER;
v_file BFILE;
v_wordlist VARCHAR2(8000);
BEGIN
SELECT text_id
BULK COLLECT INTO c_text_table
FROM text_table;
SELECT word
BULK COLLECT INTO c_words_table
FROM words;
FOR i IN text_id.FIRST..text_id.LAST
LOOP
BEGIN
SELECT text INTO v_file FROM text_table WHERE text_id = c_text_table(i);
DBMS_LOB.OPEN(v_file, DBMS_LOB.LOB_READONLY);
FOR j IN c_words_table.FIRST..c_words_table.LAST
LOOP
IF regexp_like( v_file,
'["''*() .;:,^!-]' || c_words_table(j) || '["''() .;:,$!-?*]', 'i' )
THEN
v_wordlist := v_wordlist || c_words_table(j) || ' ';
END IF;
END LOOP;
DBMS_LOB.CLOSE( v_file );
IF v_wordlist || '+++' != '+++'
THEN
INSERT INTO vtest ( ID, wordlist )
VALUES ( text_id(i), v_wordlist );
END IF;
v_wordlist := '';
EXCEPTION
WHEN OTHERS THEN
DBMS_LOB.CLOSE( v_file );
RAISE;
END;
END LOOP;
-- You should only need to commit once.
COMMIT;
END;
/
-
After loading the cursors into collections and retesting I ended up gaining 2 seconds per text search (17-15 sec).
I took the regexp_like out to find out exactly how much time each was consuming. It ran through 200 clobs in 4 seconds (looping through the words in a collection 5177 each time).
I looked at the text clob and found that each has about 950 trailing blank spaces. After trimming the blanks, we're down to ~4 seconds each. That's more like it.
This also means that we can save about 80-100mb /day of stored blanks.
Thanks for all your help on this thread.
-
Would it help to clean up the expression with something like the following?
Code:
insert into vtest (id, wordlist)
select t.text_id, w.word
from text_table t,
words w
where regexp_like(t.text,'([^:alpha:]||^)'||w.word||'([^:alpha:]||$)','i')
I also wonder whether it would be worth "pre-parsing" the string with a LIKE operation just to reduce the amount of times that you call REGEXP_LIKE ...
Code:
insert into vtest (id, wordlist)
select text_id, word
from (
select t.text_id, w.word, t.text
from text_table t,
words w
where Upper(t.text) like '%'||Upper(w.word)||'%'
and rownum >= 1
)
where regexp_like(t.text,'([^:alpha:]||^)'||w.word||'([^:alpha:]||$)','i')
Last edited by slimdave; 10-15-2004 at 05:45 PM.
-
Perhaps you're right. What would the query look like to check 1,000 clob's against 5,177 words in a like condition? We'd most likely get all of them anyways.
Thanks for all your smart thinking.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|