Searching for text within a database can be like searching for the proverbial needle in a haystack. Users depend on the database administrator to solve complex text search requirements. Do you respond with clever methods such as LIKE, regular expressions and function-based indexes? Is there something more available to you and your users?
As a database user, searching
for text within a database can be like searching for the proverbial needle in a
haystack. As a DBA, you are more than likely going to be the primary resource
in helping a user solve a complex text search requirement. Are you left with
clever ways to use LIKE, regular expressions and function-based indexes, or is
there something more available to you and your users?
In prior articles (Searching and
Sorting Strings in Oracle, Sorting Strings
in Oracle and Multilingual
linguistic searching and sorting in Oracle), I covered linguistic
searching and sorting, well, at least how Oracle does it anyway. The search for
strings can be trivial, but at the same time, the importance of finding a
particular string - or something close enough to it - can have extreme
importance. We've all read or heard about the horror stories of airline
travelers who are mistakenly placed on the no-fly list. On the other hand, and
luckily enough so far, the consequences of a mistake the other way (someone
getting on board who shouldn't have been able to in the first place) haven't
been too severe. Of course, this doesn't address the issue of when the
information is available, but isn't checked, or not checked often enough.
Typical searches for strings
are against one column within one table in a schema. We also generally know
what we are looking for, so if we're not able to code an exact match in a WHERE
clause, we can get lucky to some degree by using the LIKE condition or
operator. In fact, when using the LIKE condition, did you know there is more
than one LIKE-like operator? Oracle allows the use of LIKEC, LIKE2 and LIKE4,
with their meanings described in the SQL Reference Guide. (10g version here).
What LIKE offers us is a
near-match exact hit based on the operand (what we're searching for). With the
wildcard (%), we can find matches containing or including the base string. The
example below is pretty clear-cut in terms of searching for a name containing
the string AK.
SQL> select ename
2 from emp
3 where ename like '%AK%';
ENAME
----------
BLAKE
Just out of curiosity, is the
following query equivalent?
SQL> select ename
2 from emp
3 where '%AK%' like ename;
The answer is no, because the
operand for LIKE has to follow LIKE; it cannot come before.
SQL> select ename
2 from emp
3 where '%AK%' like ename;
no rows selected
In these examples, we know
where to look, and maybe aren't quite sure what we're looking for, but we're
pretty close (near match) to begin with. What happens when the search is based
on strings where the parts are not "close enough" together, or where
we are looking for keywords out of a larger set of data? This is where Oracle
Text comes into play.
In Oracle8, Oracle Text
appeared as a data cartridge and was named Oracle ConText. In 8i days,
it was renamed to Oracle interMedia Text, and since Oracle9i and on, it
has been know as Oracle Text. One improvement in this feature is that Oracle
Text is now included with the RDBMS software installation and does not require
any additional licensing or module-like plug-ins. Throughout its history, the
schema associated with Oracle Text has been CTXSYS. One change is that CTXSYS's
objects, like many other utility types of accounts, are in the SYSAUX
tablespace.
You've seen other SYS-like
accounts during installation of Oracle, many of which have no use or relevance
for most database-driven applications. If you need to expand text searching,
including crawling a web page for keywords or examining an XML document, Oracle
Text is exactly what you need. Understanding how and where Oracle Text fits
into the big scheme of things requires a sampling of some white papers
available at Oracle
Technology Network (and yes, you'll see references to older versions of Oracle even
though Text appears under the 11g banner).
In Oracle's documentation, Oracle
Text is described as the following.
Oracle Text is a technology
that enables you to build text query applications and document classification
applications. Oracle Text provides indexing, word and theme searching, and
viewing capabilities for text.
This description does a
disservice to what Text can do, as a white paper
description shows.
Oracle Text uses standard SQL
to index, search, and analyze text and documents stored in the Oracle database,
in files, and on the Web. Oracle Text can perform linguistic analysis on
documents; search text using a variety of strategies including keyword searching,
contextual queries, Boolean operations, pattern matching, mixed thematic
queries, HTML/XML section searching, etc. Oracle Text excels at mixed queries,
i.e. those that involve structured relational attributes as well as text.
Do you subscribe to newsletters
or RSS feeds? How does the application know when to send you notification of an
event? In a financial setting, you don't want to be inundated with every
finance-related news item, but rather only receive notifications of interest.
If your interest happens to be Oracle and phrases containing "market
share," then a rule-based decision can be made to send you news when
Oracle and market share appear. The architecture of
such a system is shown below.
So, what are some differences
between Oracle Text and "normal" SQL queries? Not a whole lot,
really. As with many statements, the object type, which helps to speed up queries,
is our friendly index. Just as in regular DML and queries, indexes help to
speed up searches in Oracle Text. The type of index and how they are
used/created is a bit different, but once you understand some of the
terminology in Text, their meaning becomes clearer.
One major variation between
normal indexes and the indexes used by Oracle Text is how or when an index is
updated. In normal (regular SQL) DML, an index is updated upon insert into a
table (obviously assuming the column of interest is indexed). In Oracle Text,
this is not always the case, and it is up to you to perform a synchronization.
Statement syntax is also another significant variation. The good news is that
the output looks like the output from your run of the mill SELECT statement.
The four index types (generally
domain indexes for that matter) and their query operators are:
- CONTEXT,
using CONTAINS
- CTXCAT,
using CATSEARCH
- CTXRULE,
using MATCHES
- CTXPATH,
using existsNode().
Some of the indexes also use
parameters, and those will be examined in subsequent articles. The query
operator names are somewhat intuitive in how they support the index type. In a
context search (based on large coherent documents), we want to know if the text
contains what we're looking for. For many, but smaller documents (also of
various types), we have to search through a catalog, so we perform a catalog
search. In most contexts, we know that rules require a match, and one thing you
can't escape in XML is searching a path to see if something exists (at a node).
In Closing
Overall, the idea of Oracle
Text is simple, but some of its use and implementation can be complex, but like
many other topics in Oracle, once you see some working examples, the complexity
generally tends to go away.
In subsequent articles, we'll
cover the basic setup of using Oracle Text and examine examples shown in the
documentation and some of the sample code available at OTN.
Additional Resources
Oracle Text
Next
Back to DBAsupport.com