Joe Celko on IDENTITY

**chrisrlong** · 04-26-2001, 02:38 PM

Thought this would be interesting for this forum to discuss. While he is addressing SQLServer IDENTITY fields, you can easily replace IDENTITY with Trigger-Sequence-populated PK field. The crux of the argument lies in point 5, where he states:
"If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they model. "

For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!

However, I would like to hear what everyone else thinks about this:

>> How is the use of an IDENTITY column as a primary key not relational?
I'm
under the impression it is. <<

Real SQL programmers use real keys and do not try to imitate 1950's
magnetic
tape or punch card systems in a relational model. In the early days of
programming languages, we exposed a lot of the physical and hardware
implementation to the programmer. For example, COBOL and FORTRAN depend
on
physically contiguous storage of data, which made a ROWID (i.e. physical
location reference) possible.

Later, we designed languages to independent of any physical and hardware
implementation. Thus, an INTEGER datatype did not have to be eight
binary
bits in a two complement format. It just had to behave like an integer
in
the program and the programmer did not have to worry about how the
hardware
did its work -- or even know what the hardware was.

SQL and other modern programming languages carry this idea further and
try
to completely separate logical and physical implementations. The idea
is
that SQL is based on sets (which have no ordering) and a standard that
defines its logical behavior. The behavior has nothing to do with
whether a
product uses B-Tree indexes, bit vectors or hashing; two complement or
base
ten arithmetic; whether the host program calling the SQL is C or Cobol;
etc.

The IDENTITY column in SQL Server is an attempt to return to those
thrilling
days of yesteryear, and the conceptual junk left over from the days when
people did not know much about relational databases. What we knew was
sequential file systems -- punch cards and magnetic tapes. Even the
disk
file systems mimicked these systems, adding only simple indexes.

Sequence was a basic way of looking at data. People thought in terms of
them at a primitive level. A sequence of bits make a byte, a sequence
of
bytes make a field, a sequence of fields make a record and sequence of
records make a file. Very low level, very close to the machinery.

Then along comes the relational model. It is based on sets; a set is a
completed whole, without any ordering to it. No sequences! Very
abstract!
Programmers did not know how to cope, so the vendors exposed the
physical
implementation and called these things "features" and locked their
products
to particular architectures. I can go into details on that problem, but
let
me say that when we went to the bar after ANSI X3H2 meetings, the
vendors
griped about what they had to do to these extensions to preserve them in
the
next platform, how they could not scale up to data warehouse size
databases,
etc.

The IDENTITY column is one of these mistakes.

1) It is not part of the SQL-92 Standard and it is highly proprietary to
the
Sybase family. It is not portable -- not quite the same thing as
proprietary, since you can often translate one SQL dialect into another
with
a simple replacement (i.e. the % operator becomes the MOD() function).
So
your code will not move over to a new database.

2) IDENTITY looks like a datatype, but it is not. Create a table with
one
column in it and make it an IDENTITY column. Insert a number into the
table
and see what happens. Try to set it to NULL. If you cannot insert,
update
and delete all the columns, then this is not a table!

3) IDENTITY looks like a constraint, but it is not. Try to create a
table
with two IDENTITY columns and it fails. If you cannot add it to a
column,
then it is not a constraint. It is possible to write a a set of
constraints
that prohibit data from ever being put in the table (their predicate is
always FALSE). It is possible to write a a set of constraints that
allow
anything in the table (their predicate is always TRUE). But no
constraint
can prohibit the creation of the table itself -- that is a
meta-constraint.

4) It is not relational. Consider this statement on a table, Foo, which
has
an identity column. Assume the query returns more than one row.

INSERT INTO Foo (x)
SELECT a FROM Bar;

You will get a result like this:

IDENTITY X
============
1 'a'
2 'b'
3 'c'

but if the query changed an index or was put on the physical disk data
page
differently, you might have gotten:

IDENTITY X
============
1 'b'
2 'c'
3 'a'

Explain why one result is the logically correct choice for an identifier
and
all other choices are not, without any reference to the physical
implementation. You cannot.

Instead of treating the query as a set, you are doing 1950's sequential
processing using the underlying sequential file system the Sybase family
started with.

5) If you have designed your tables correctly, they will have a
meaningful
primary key derived from the nature of the entity they model. The
IDENTITY
column should be a redundant key. The reason IDENTITY columns are
popular
as keys is that they are easy to declare. This is also the same reason
that
people build non-normalized databases and put pennies in fuse boxes --
easy
is not right.

6) It is a bitch to do calculations on IDENTITY column values. Well, it
was
hard to do direct math on the sequential position of a record in a
1950's
punch card system and that it what the IDENTITY is mimicking.

7) There is no check digit in an IDENTITY columns value, so you have no
way
of verifying it if you use it as a key.

8) If you use IDENTITY as a key, the values tend to cluster on physical
data
pages because they are sequential. The result is that if the most
recent
rows are the most likely to be accessed, there will be locking
contention
for control of those physical data pages. What you really wanted in a
key
is some spread of the rows over physical storage to avoid having every
user
trying to get to the same page at the same time.

9) The actual implementation of the IDENTITY column has been problematic
since Version 7.0. You can look up threads in the news groups to get
assorted tales of woe.

--CELKO--

SQL guru at Trilogy
===========================

- Chris

**jmodic** · 04-26-2001, 02:56 PM

Originally posted by chrisrlong
For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!

However, I would like to hear what everyone else thinks about this:
..SNIP...

Not going into details about IDENTIRY in SQL Server as I don't have any exeriences with it, I strongly disagree with Celko (and agree with you) about the use of surrogat keys for PKs. Particulary in DW environment with dimension models (star- and snowflake designs), surrogat keys are a must. Ralph Kimbal as one of the most prominent expert for DW concepts doesn't even want to discuss anything other than meaningless numbers (surrogate keys) for PKs.

**marist89** · 04-26-2001, 03:10 PM

Originally posted by chrisrlong
Thought this would be interesting for this forum to discuss. While he is addressing SQLServer IDENTITY fields, you can easily replace IDENTITY with Trigger-Sequence-populated PK field. The crux of the argument lies in point 5, where he states:
"If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they model. "

For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!

Wow, that's a pretty strong statement. I agree that there are certain situations where a surrogate PK is necessary. However, there are other situations in which the use of a surrogate PK will hamper the physical implementation. For example, say you have a table that contains historical and is structured as:
trade_dt date,
exchange_cd varchar2(10),
symbol_cd varchar2(10),
price number(20,10)

The natural PK for this table would be trade_dt, exchange_cd, and symbol_cd.

As the DBA I would want to option to partition this table based on the date range. That way, I could move my "active" data to a faster tablespace and my "archive" data to a slower tablespace.

However, if you add a surrogate PK to the table and structure it as:
surrogate_id number(10),
trade_dt date,
exchange_cd varchar2(10),
symbol_cd varchar2(10),
price number(20,10)

partitioning based on date range becomes difficult. If you want to use local partitions, you have to estimate how many ID number will fall in the particular time period you are interested in partitioning.

You could continue to use a regular index and partition on date, but when you move a partition from one tablespace to another the index becomes unusable and you have to rebuild it.

I think in this situation, using an artificial PK hampers the design.

However, I would like to hear what everyone else thinks about this:

>> How is the use of an IDENTITY column as a primary key not relational?
I'm
under the impression it is. <<

Isn't one of the rules of data normalization that each attribute must contribute to the relation? (or something like that). I think his take is that since a child record doesn't hold meaning without it's parent record, why does it need an ID number? It's just extra "stuff" along for the ride.

That being said, I think it's perfectly reasonable to use sequential numbers for things like ID numbers, time-ordered events, etc.

**gsprince** · 04-26-2001, 03:40 PM

"Isn't one of the rules of data normalization that each attribute must contribute to the relation? (or something like that). I think his take is that since a child record doesn't hold meaning without it's parent record, why does it need an ID number"

I guess, in this case whenever child records are inserted, the IDENTITY (or PK) of the parent table is also carried over, to have the relation maintained. (here the parent key as such is not a part of the child's Primary Key)

Thus,

Table1:
----------
Tab1Id1 number, --PK (IDENTITY)
Tab1Comp1 number,
Tab1Comp2 varchae2(10),
Tab1Comp3 varchae2(10),
Col1 varcha2(10),
Col2 varcha2(10),
Constraint UQ_Tab1Com
Unique (Tab1Comp1, Tab1Comp2, Tab1Comp3)

Table2:
----------
Tab2Id1 number, --PK (IDENTITY)
Tab1Id1 number, --FK
Tab2Col1 varcha2(10),
Tab2Col2 varcha2(10)

It does make sense in some cases and may not at some other places as noted by Jeff.

**chrisrlong** · 04-26-2001, 03:49 PM

Actually, I know of no normalization rule that dis-allows the use of a surrogate primary key, but I'll have to think about that some more, as I am also a major proponent of normalization. I'd hate think I've been contradicting myself

.

And, yes, I guess I was a little over-emphatic in my support or surrogate keys, but not much. There are sooo many benefits to using surrogate keys, that one has to have a very strong argument to convince me not to use one.

And speaking of such an argument

, you may have one. Can you give me some more detail about the locally-partitioned index issue. Before I go out with a book preaching the gospel according to surrogate keys, I'd like to better understand any drawbacks

, and I hadn't really picked up on that one - partitioning is one of the things I have not seriously dug into yet. On the surface, however, it sounds like something with a significant-enough fallout that maybe Oracle is planning on changing it in 9i. Much like their lame notion of 'PL/SQL's SQL engine lags behind the base SQL engine' that is being 'fixed' in 9i.

Anyway, if you could explain that to me a little more or feed me a link, it would be greatly appreciated. Also, if there are any other drawbacks to surrogate keys that you can think of, please let me know.

Thanks,

- Chris

**marist89** · 04-26-2001, 04:13 PM

In general, you can look at:
http://technet.oracle.com/docs/produ...rti.htm#437783

Local indexes are discussed in:
http://technet.oracle.com/docs/produ...rtiti.htm#5979

Basically, when you partition a table by range, your table is conceptually split up into many different tables each containing a certain set of the data. A local index is an index that is partitioned as well using the same partition bounds as the base table. For example, take my table from before:
trade_dt date,
exchange_cd varchar2(10),
symbol_cd varchar2(10),
price number(20,10)

Say you wanted this table partitioned by month, you would create 12 partitions. A local index would also have 12 partitions each corresponding to the partition of the base table. (The table partition for JAN would have a corresponding partition for JAN). You essentially have 12 tables and 12 indexes that Oracle knows how to put back together.

The problem comes in when you don't partition on the PK. Say we take:
surrogate_id number(10),
trade_dt date,
exchange_cd varchar2(10),
symbol_cd varchar2(10),
price number(20,10)

and partition it on trade_dt by month. If the surrogate_id is the PK, it can NOT be a partitioned index because it is not part of the partition key. Therefore, if you modify any of the partitions, Oracle doesn't know how to tell the index what's going on and it flags the index as "UNUSABLE". You then have to go and rebuild the index.

**jmodic** · 04-26-2001, 04:56 PM

Jeff,

Your remarks about problems with rate partitioning are valid, but only up to some point. As allways, there are pros and cons about using real meaningfull values for PKs. You have to draw a line under all pluses and minuses and see what to use, all in the context of your bussines rules.

You say the things get complicated when you want to use date-range partitioning in your example with surrogat PK when you want to move partitions etc. True, but the partition-managing tasks are generally more an exception than a rule, so in those unavoidable exceptional situations DBA should be able to perform a global index rebuild as a part of a partition-managing task.

On the other hand, what if your bussines requires that sometime trade_dt date value can be changed (clerk who entered it initially made a mistake, he should be able to correct that mistake...)? In the lack of update-cascade functionality it sometimes gets so complicated to implement this. But if you have surrogate PK it is a simple update, nothing else.

Speaking of date datatype as a part of a primary key, it is sometimes unavoidable to use surrogate values instead of dates in PKs. A second as the lowest date granularity is simply not sufficient in many real life cases to relay on this datatype as a distinctive value for each record. If Oracle had the timestamp datatype it would be different, but then again - isn't the timestamp also a kind of surrogate value?

**chrisrlong** · 04-26-2001, 04:58 PM

First, thanks for the input, and I will absolutely look into the issue some more.

But back to the original topic:

We have a break-through!!!

A friend of mine sent a message to Mr. Celko asking him several questions about his stance on IDENTITY columns. After a lot of other feedback (including a massive dissertation on the makup of SSNs

), we have this nugget towards the end:

"A surrogate key is fine with me, but I don't think we mean the same thing by that term. A surrogate is NEVER exposed to the user at all, but is created by the database engine itself for speed -- a hash code for a long key, a pointer based on the current physical storage, etc.

An artificial key is exposed to the users and it is something that was made up as an identifier for our system. Let's say we have no industry standard customer number in our model, so we want to assign one to everybody we do business with. It will be exposed to the customer on all his paper work with us. So we want a check digit to be sure that we can verify it when it is keyed in. We would like it to be big enough to over
all possible customers we will ever have. This and a lot of other things are part of data design. "

Sooo, the universe once again makes sense. My head has been spinning the last couple hours trying to understand how one of my idols could have fallen so far

. It seems it was mostly a mis-undertanding. To translate, his idea of a surrogate key is the same as ours, or at least mine - it is *never* displayed to the user! What he has been railing against (besides the IDENTITY implementation) has been what he calls 'artificial' keys, which *are* displayed to the user - his example of Customer ID brings it into perfect relief. I also now understand what he means about 'verifiability' and check digits. And on top of that, I agree with much of what he is saying. It is, indeed, part of the data design process to determine how such artifical keys will be generated, provided and verified.

All is once again right with the world

- Chris

[Edited by chrisrlong on 04-26-2001 at 05:01 PM]

**Zaggy** · 04-26-2001, 06:01 PM

I use surrogates and natural keys, I pick them based on many different factors like:

1. How is the table going to be used (ie parent, child, lookup, etc..)
2. The extra, possibly uselesss, index created by a surrogate key (Becuase I have a system that inserts over 2 million rows an hour on a weak machine. I do not want the extra overhead of a potentialy useless index)
3. Possible duplicate entries
4. Denormalization
5. and the list goes on

Here is a page I found on the subject.

http://www.dorna-consultants.com/consulting.htm

Here is my two-cents on this subject.

Thread: Joe Celko on IDENTITY

Thread Tools

Display

Posting Permissions