When it comes to searching, how often do you find yourself
using the LIKE (including NOT LIKE) operator in SQL statements? Other well-known
tests include equality and comparison (greater or less than), and this includes
not only strings, but also numbers. Conceptually, the equality/comparison tests
are pretty simple, and indexes can add a tremendous boost to performance when
these tests are used. However, what takes place when we’re not looking for an exact
match, but something, well, “LIKE” the value of interest? Mathematically, or at
least algorithm-wise, constructing a test for equality is pretty simple. How do
you suppose Oracle approaches the problem of determining if the string ‘ABCD’
appears in the string ‘ABCABDABCDAB’ and if it does, how many times (as in the
use of INSTR)? Further, once we have a result set, what determines the sorted
order?
The focus of Part 1 is on searching strings. In Part 2, I’ll
discuss sorting strings.
One thing that stands out in string searching is relevance.
When testing equality or for a range, the values which are returned can be
ranked or ordered. For X>500, a returned element of 1000 probably has much
more significance than an element valued at 501. For strings, it is difficult
to assign any meaning to Johns versus Johnson when searching for names like
John. I may be interested in the first and you may be interested in the second
name. There will only be relevance to the user at that time, but for someone
else, the opposite may be true. For a scalar, 501 will always be greater than
500, so the interpretation is always the same.
From an algorithm design standpoint, we want the process to
be performant. The complexity of the
algorithm should be first order, as in O(n) (the big “O” notation) as opposed
to O(n2), nLogO(n) and so on. This implies that the algorithm is
efficient, and efficiency can be qualified with goals of not repeating work,
skipping over intervals, and be deterministic (we can compute the best case and
worst case).
In addition to string searching within SQL, virtually every
text editor incorporates a search function. Without knowing the implementation
details or underlying code, it is hard to determine exactly how a search editor
works. Like most algorithms, there will be cases where some general or even
trivial condition will make the algorithm scream along or slow to a crawl.
Searching for a word within a string, where the word is longer than the
searched string, is an example of a case where adding a simple length check up
front will eliminate at least one condition. On the other hand, searching for
something not very distinguishable (but distinguishable enough because the
pattern does match) can evoke a worst-case performance (search for A in the
string BBBBBB…BBBA, no hit until the very end). Overall, the search should be
done in no worse than what is called linear time, or O(n).
So, from which end of the target string should the search
start? One widely used algorithm, because of its efficiency, is the Boyer-Moore
string search algorithm. This algorithm actually starts at the end and works
backwards. Once a miss is encountered (a “bad” character, that is, one that is
not in the search pattern/string), the algorithm knows to shift the search over
by the length of the search. If the bad character falls anywhere within the
positions covered by the length, then it is impossible to find a match within
that interval as the search string would span the bad character.
Now that we know the general approach of this algorithm, how
do you think it applies to Oracle? If the test operator is LIKE, perhaps Oracle
uses Boyer-Moore because we, and subsequently Oracle, don’t care where the
pattern is met or how many times it is met, just that it is met at least once.
On the other hand, if using INSTR, we have a choice as to where to start the
search (beginning or end), in addition to which occurrence. The fact that this
function defaults to the beginning of the target should not imply anything
about how Oracle conducts the search. For all we know, Oracle may internally
reverse the string for search purposes (which can be quickly done) and leave
the default start position as the beginning because this is more intuitively
obvious to humans (not implying Oracle is alien, just that in English, we tend
to start from left to right).
The example for the INSTR function in the SQL Reference
guide is shown below, followed by a similar query where I let the default start
position and occurrence both go to 1.
SQL> SELECT INSTR('CORPORATE FLOOR','OR', 3, 2)
2 "Instring" FROM DUAL;
Instring
----------
14
SQL> ed
Wrote file afiedt.buf
1 SELECT INSTR('CORPORATE FLOOR','OR')
2* "Instring" FROM DUAL
SQL> /
Instring
----------
2
This would suggest that defaulting to 1,1 for start and
occurrence AND not caring about the position returned by the function is
equivalent to using the pattern matching operation of LIKE.
What may not be obvious at this point is this fact: the
longer the search string/pattern, the quicker the search is likely to be performed.
If you divide a searched word (or string) of length N by the length M of the
search pattern, you have at most N/M chunks to search. The more information,
that is, the longer the search pattern is, you can provide up front, the less
work the search algorithm has to perform. This is akin to searching for a
needle in a haystack. It should be obvious that the longer the needle, the
easier it will be to find. The same holds true for string searches.
However, does this hold true in Oracle? A simple test
suggests that Boyer-Moore may not be the algorithm used for string searches.
set serveroutput on
declare
l_start number;
v_count number;
cursor c is
select last_name from hr.employees;
begin
execute immediate 'alter system flush shared_pool';
execute immediate 'alter system flush buffer_cache';
l_start := dbms_utility.get_time;
for r in c loop
v_count := instr(r.last_name,'x');
end loop;
dbms_output.put_line('One char time = '||
to_char(dbms_utility.get_time - l_start));
execute immediate 'alter system flush shared_pool';
execute immediate 'alter system flush buffer_cache';
l_start := dbms_utility.get_time;
for r in c loop
v_count := instr(r.last_name,'xxxxxxxxxx');
end loop;
dbms_output.put_line('Multi char time = '||
to_char(dbms_utility.get_time - l_start));
end;
/
One char time = 12
Multi char time = 62
PL/SQL procedure successfully completed.
Repeated numerous times, and all else being equal, the time
ratio stays at least 1 to 4 for the single character versus longer string
search, regardless if the search starts at the beginning or end. I would
suspect two things about Oracle’s string searching algorithm in support of ANSI
SQL: there may not be a length check performed beforehand (it’s extra work, so
reasonable to assume this), and the start end does matter.
In Closing
The string searches mentioned so far are based on pattern
matching. Another way to classify this type of search is hit or miss. What
about “close,” as in “close only counts in hand grenades and horseshoes?”
Oracle Text, in fact, considers “close” or fuzzy to be a type of search.
Queries are based on CONTAINS and documents that match are returned in a hitlist.
Our searching via SQL can emulate fuzzy searching to some degree by use of
wildcard expansion, but that’s for another article.
Next
Back to DBAsupport.com