Joe Celko on IDENTITY

**chrisrlong** · 04-26-2001, 02:38 PM

Thought this would be interesting for this forum to discuss. While he is addressing SQLServer IDENTITY fields, you can easily replace IDENTITY with Trigger-Sequence-populated PK field. The crux of the argument lies in point 5, where he states:
"If you have designed your tables correctly, they will have a
meaningful primary key derived from the nature of the entity they model. "

For the record: I COMPLETELY AND TOTALLY DISAGREE WITH THIS STATEMENT. I cannot emphasize enough how whole-heartedly I support and promote the use of surrogate keys for all PKs!

However, I would like to hear what everyone else thinks about this:

>> How is the use of an IDENTITY column as a primary key not relational?
I'm
under the impression it is. <<

Real SQL programmers use real keys and do not try to imitate 1950's
magnetic
tape or punch card systems in a relational model. In the early days of
programming languages, we exposed a lot of the physical and hardware
implementation to the programmer. For example, COBOL and FORTRAN depend
on
physically contiguous storage of data, which made a ROWID (i.e. physical
location reference) possible.

Later, we designed languages to independent of any physical and hardware
implementation. Thus, an INTEGER datatype did not have to be eight
binary
bits in a two complement format. It just had to behave like an integer
in
the program and the programmer did not have to worry about how the
hardware
did its work -- or even know what the hardware was.

SQL and other modern programming languages carry this idea further and
try
to completely separate logical and physical implementations. The idea
is
that SQL is based on sets (which have no ordering) and a standard that
defines its logical behavior. The behavior has nothing to do with
whether a
product uses B-Tree indexes, bit vectors or hashing; two complement or
base
ten arithmetic; whether the host program calling the SQL is C or Cobol;
etc.

The IDENTITY column in SQL Server is an attempt to return to those
thrilling
days of yesteryear, and the conceptual junk left over from the days when
people did not know much about relational databases. What we knew was
sequential file systems -- punch cards and magnetic tapes. Even the
disk
file systems mimicked these systems, adding only simple indexes.

Sequence was a basic way of looking at data. People thought in terms of
them at a primitive level. A sequence of bits make a byte, a sequence
of
bytes make a field, a sequence of fields make a record and sequence of
records make a file. Very low level, very close to the machinery.

Then along comes the relational model. It is based on sets; a set is a
completed whole, without any ordering to it. No sequences! Very
abstract!
Programmers did not know how to cope, so the vendors exposed the
physical
implementation and called these things "features" and locked their
products
to particular architectures. I can go into details on that problem, but
let
me say that when we went to the bar after ANSI X3H2 meetings, the
vendors
griped about what they had to do to these extensions to preserve them in
the
next platform, how they could not scale up to data warehouse size
databases,
etc.

The IDENTITY column is one of these mistakes.

1) It is not part of the SQL-92 Standard and it is highly proprietary to
the
Sybase family. It is not portable -- not quite the same thing as
proprietary, since you can often translate one SQL dialect into another
with
a simple replacement (i.e. the % operator becomes the MOD() function).
So
your code will not move over to a new database.

2) IDENTITY looks like a datatype, but it is not. Create a table with
one
column in it and make it an IDENTITY column. Insert a number into the
table
and see what happens. Try to set it to NULL. If you cannot insert,
update
and delete all the columns, then this is not a table!

3) IDENTITY looks like a constraint, but it is not. Try to create a
table
with two IDENTITY columns and it fails. If you cannot add it to a
column,
then it is not a constraint. It is possible to write a a set of
constraints
that prohibit data from ever being put in the table (their predicate is
always FALSE). It is possible to write a a set of constraints that
allow
anything in the table (their predicate is always TRUE). But no
constraint
can prohibit the creation of the table itself -- that is a
meta-constraint.

4) It is not relational. Consider this statement on a table, Foo, which
has
an identity column. Assume the query returns more than one row.

INSERT INTO Foo (x)
SELECT a FROM Bar;

You will get a result like this:

IDENTITY X
============
1 'a'
2 'b'
3 'c'

but if the query changed an index or was put on the physical disk data
page
differently, you might have gotten:

IDENTITY X
============
1 'b'
2 'c'
3 'a'

Explain why one result is the logically correct choice for an identifier
and
all other choices are not, without any reference to the physical
implementation. You cannot.

Instead of treating the query as a set, you are doing 1950's sequential
processing using the underlying sequential file system the Sybase family
started with.

5) If you have designed your tables correctly, they will have a
meaningful
primary key derived from the nature of the entity they model. The
IDENTITY
column should be a redundant key. The reason IDENTITY columns are
popular
as keys is that they are easy to declare. This is also the same reason
that
people build non-normalized databases and put pennies in fuse boxes --
easy
is not right.

6) It is a bitch to do calculations on IDENTITY column values. Well, it
was
hard to do direct math on the sequential position of a record in a
1950's
punch card system and that it what the IDENTITY is mimicking.

7) There is no check digit in an IDENTITY columns value, so you have no
way
of verifying it if you use it as a key.

8) If you use IDENTITY as a key, the values tend to cluster on physical
data
pages because they are sequential. The result is that if the most
recent
rows are the most likely to be accessed, there will be locking
contention
for control of those physical data pages. What you really wanted in a
key
is some spread of the rows over physical storage to avoid having every
user
trying to get to the same page at the same time.

9) The actual implementation of the IDENTITY column has been problematic
since Version 7.0. You can look up threads in the news groups to get
assorted tales of woe.

--CELKO--

SQL guru at Trilogy
===========================

- Chris

Thread: Joe Celko on IDENTITY

Thread Tools

Display

Threaded View

Posting Permissions